Course Contents

  1. 2022.12.07: Introduction: About the course [lead by TK]
    • An introduction to open and public data, and data science
  2. 2022-12-14: Exploratory Data Analysis (EDA) 1 [lead by hs]
    • R Basics with RStudio and/or RStudio.cloud; Toy Data
  3. 2022-12-21: Exploratory Data Analysis (EDA) 2 [lead by hs]
    • R Markdown, tidyverse I: dplyr; gapminder
  4. 2023-01-11: Exploratory Data Analysis (EDA) 3 [lead by hs]
    • tidyverseII: readr, ggplot2; Public Data, WDI, WIR, etc
  5. 2023-01-18: Exploratory Data Analysis (EDA) 4 [lead by hs]
    • tidyverse III: tidyr, etc.; WDI, WIR, etc
  6. 2023-01-25: Exploratory Data Analysis (EDA) 5 [lead by hs]
    • tidyverse IV; WDI, WIR, etc
  7. 2023-02-01: Introduction to PPDAC
    • Problem-Plan-Data-Analysis-Conclusion Cycle: [lead by TK]
  8. 2023-02-08: Model building I [lead by TK]
    • Collecting and visualizing data and Introduction to WDI
      (World Development Indicators by World Bank)
  9. 2023-02-15: Model building II [lead by TK]
    • Analyzing data and communications
  10. 2023-02-22: Project Presentation

1 Exploratory Data Analysis (EDA) I

1.1 R with R Studio and/or R Studio.cloud


1.1.1 Learning Resources

1.1.1.1 Textbooks and References

1.1.2 Interactive Exercises


1.1.3 Posit Primers created by learnr

1.1.3.1 Posit Primers https://posit.cloud/learn/primers

  1. The Basics – r4ds: Explore, I
  1. Work with Data – r4ds: Wrangle, I
  • Working with Tibbles
  • Isolating Data with dplyr
  • Deriving Information with dplyr
  1. Visualize Data – r4ds: Explore, II
  2. Tidy Your Data – r4ds: Wrangle, II
  3. Iterate – r4ds: Program
  4. Write Functions – r4ds: Program

1.1.4 Data Science and EDA

1.1.4.1 Wikipedia https://en.wikipedia.org/wiki/Data_science

An inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.

  • Create Insights
  • Impact Decision Making
  • Maintain & Improve Overtime

1.1.5 What is R?

1.1.5.1 R (programming language), Wikipedia

  • R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing.

  • The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

  • A GNU package, the official R software environment is written primarily in C, Fortran, and R itself (thus, it is partially self-hosting) and is freely available under the GNU General Public License.


1.1.5.2 History of R and more

“R Programming for Data Science” by Roger Peng


1.1.6 Why R? – Responses by Hadley Wickham

1.1.6.1 r4ds: R is a great place to start your data science journey because

  • R is an environment designed from the ground up to support data science.
  • R is not just a programming language, but it is also an interactive environment for doing data science.
  • To support interaction, R is a much more flexible language than many of its peers.

1.1.6.2 Why R today?

When you talk about choosing programming languages, I always say you shouldn’t pick them based on technical merits, but rather pick them based on the community. And I think the R community is like really, really strong, vibrant, free, welcoming, and embraces a wide range of domains. So, if there are like people like you using R, then your life is going to be much easier. That’s the first reason.

Interview: “Advice to Young (and Old) Programmers, H. Wickham”


1.1.7 What is RStudio? https://posit.com

RStudio is an integrated development environment, or IDE, for R programming.

1.1.7.1 R Studio (Wikipedia)

RStudio is an integrated development environment (IDE) for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser.


1.1.8 Installation of R and R Studio

1.1.8.1 R Installation

To download R, go to CRAN, the comprehensive R archive network. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don’t try and pick a mirror that’s close to you: instead use the cloud mirror, https://cloud.r-project.org, which automatically figures it out for you.

A new major version of R comes out once a year, and there are 2-3 minor releases each year. It’s a good idea to update regularly.

1.1.8.2 R Studio Installation

Download and install it from http://www.rstudio.com/download.

RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know.


1.1.9 R Studio

1.1.9.1 The First Step

  1. Start R Studio Application
  2. Top Menu: File > New Project > New Directory > New Project > Directory name or Browse the directory and choose the parent directory you want to create the directory

1.1.9.2 When You Start the Project

  1. Go to the directory you created
  2. Double click _‘Directory Name’.Rproj

Or,

  1. Start R Studio
  2. File > Open Project (or choose from Recent Project)

In this way the working directory of the session is set to the project directory and R can search releted files without difficulty (getwd(), setwd())


1.1.10 Posit Cloud

RStudio Cloud is a lightweight, cloud-based solution that allows anyone to do, share, teach and learn data science online.

1.1.10.1 Cloud Free

  • Up to 15 projects total
  • 1 shared space (5 members and 10 projects max)
  • 15 project hours per month
  • Up to 1 GB RAM per project
  • Up to 1 CPU per project
  • Up to 1 hour background execution time

1.1.10.2 How to Start Posit Cloud

  1. Go to https://posit.cloud/
  2. Sign Up: top right
  • Email address or Google account
  1. New Project: Project Name
  2. R Console

1.2 Let’s Get Started

Start RStudio and create a project, or login to Posit Cloud and create a project.


1.2.1 The First Examples

Input the following codes into Console in the left bottom pane.

  • The first two:
head(cars)

str(cars)
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

  • Two more:
summary(cars)
     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00  

plot(cars)


  • And three more:
plot(cars) # cars: Speed and Stopping Distances of Cars
abline(lm(cars$dist~cars$speed))


lm(cars$dist~cars$speed)

Call:
lm(formula = cars$dist ~ cars$speed)

Coefficients:
(Intercept)   cars$speed  
    -17.579        3.932  

summary(lm(cars$dist~cars$speed))

Call:
lm(formula = cars$dist ~ cars$speed)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
cars$speed    3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

1.2.1.1 Brief Explanation

  • head(cars): The first 6 rows of the pre-installed data cars.
  • str(cars): The data structure of the pre-installed data cars.
  • summary(cars): The summary of the pre-installed data cars.
  • plot(cars): A scatter plot of the pre-installed data cars.
    • plot(cars$dist~cars$speed)
    • cars$dist, cars$[[2]], cars[,2] are same
  • abline(lm(cars$dist~cars$speed)): Add a regression line of a linear model
  • lm(cars$dist~cars$speed): The equation of the regression line
  • summary(lm(cars$dist~cars$speed): The summary of the linear regression model

hist(cars$dist)


hist(cars$speed)


1.2.1.2 View and help

  • View(cars)
  • ?cars: same as help(cars)
  • ??cars: same as `help.search(“cars”)

1.2.1.3 datasets

  • ?datasets

  • library(help = "datasets")

  • data() shows all data already attached and available.


1.2.2 Practicum

Pick a data in the datasets package and try

  • head()
  • str()
  • summary()

and some more.


1.2.2.1 iris

head(iris)

str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Can you plot?

plot(iris$Sepal.Length, iris$Sepal.Width)

1.3 tidyverse Packages

1.3.1 Brief Introduction to R on RStudio

1.3.1.1 Four Panes and Tabs

  1. Top Left: Source Editor
  2. Top Right: Environment, History, etc.
  3. Bottom Left: Console, Terminal, Render, Background Jobs
  4. Bottom Right: Files, Plots, Packages, Help, Viewer, Presentation

1.3.1.2 Set up

  • Highly recommend to set the language to be “English”.
  • Create “data” directory.
Sys.setenv(LANG = "en")
dir.create("data")

1.3.1.3 Three Ways to Run Codes

  1. Console - Bottom Left Pane
  2. R Script - pull down menu under File
  3. R Notebook, R Markdown - pull down menu under File

1.3.2 Second Way: R Script

1.3.2.1 Examples: R Scripts in Moodle

  • basics.R
  • coronavirus.R
  1. Copy a script in Moodle: {file name}.R
  2. In RStudio (create Project in RStudio) choose File > New File > R Script and paste it.
  3. Choose File > Save with a name; e.g. {file names} (.R will be added automatically)

To run a code: at the cursor press Ctrl+Shift+Enter (Win) or Cmd+Shift+Enter (Mac).


1.3.3 Packages

R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN (the Comprehensive R Archive Network).

1.3.3.1 Installation and attachement

You can install packages by “Install Packages…” under “Tool” in the top menu.

  • install.packages("tidyverse")
  • install.packages("rmarkdown")

1.3.4 Third Way: R Notebook

Choose R Notebook from the pull down File menu in the top bar.

1.3.5 Edit YAML

Default* is as follows

---
title: "R Notebook"
output: html_notebook
---

Template

---
title: "Title of R Notebook"
author: "ID and Your Name"
date: "2023-01-26" 
output: 
  html_notebook:
#    number_sections: yes
#    toc: true
#    toc_float: true
---
  • Don’t change the format. Indention matters.
  • The statement after # is ignored.
  • Date is automatically inserted when you compile the file.
  • You can replace “2023-01-26” by “2022-12-14” or in any date format surrounded by double quotation marks.
  • Section numbers: - default is number_sections: no.
  • Table of contents, toc: true - default is toc: false.
  • Floating table of contents in HTML output, toc_float: true - default is toc_float: false

1.3.6 Create a Code Chunk to Attach Packages

Insert Chunk in Code pull down menu in the top bar, or use the C button on top. You can use shortcut keys listed under Tools in the top bar.

library(tidyverse)

1.4 First Example

1.4.1 Importing data

Let us assign the iris data in the pre-installed package datasets to df_iris. You can give any name starting from an alphabet, though there are some rules.

df_iris <- datasets::iris
class(df_iris)
[1] "data.frame"

The class of data iris is data.frame, the basic data class of R. You can assign the same data as a tibble, the data class of tidyverse as follows.

tbl_iris <- as_tibble(datasets::iris)
class(tbl_iris)
[1] "tbl_df"     "tbl"        "data.frame"
  • df_iris <- iris can replace df_iris <- datasets::iris because the package datasets is installed and attached as default. Since you may have other data called iris included in a different package or you may have changed iris before, it is safer to specify the name of the package with the name of the data.
  • Within R Notebook or in Console, you may get different output, and tf_iris and tbl_iris behave differently. It is because of the default settings of R Markdown.

1.4.2 Look at the data

1.4.2.1 Several ways to view the data.

The View command open up a window to show the contents of the data and you can use the filter as well.

View(df_iris)

The following simple command also shows the data.

df_iris

The output within R Notebook is a tibble style. Try the same command in Console.


slice(df_iris, 1:10)
1:10
 [1]  1  2  3  4  5  6  7  8  9 10

1.5 `

1.5.0.1 Data Structure

Let us look at the structure of the data. You can try str(df_iris) on Console or by adding a code chunk in R Notebook introducing later.

glimpse(df_iris)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4,…
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9,…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4,…
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2,…
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, seto…

There are six types of data in R; Double, Integer, Character, Logical, Raw, Complex.

The names after $ are column names. If you call df_iris$Species, you have the Species column. Species is in the 5th collumn, typeof(df_iris[[5]]) does the same as the next.

df_iris[2,4] =

is the fourth entry of Sepal.Width.


typeof(df_iris$Species)
[1] "integer"
class(df_iris$Species)
[1] "factor"

For factors = fct see the R Document or an explanation in Factor in R: Categorical Variable & Continuous Variables.


typeof(df_iris$Sepal.Length)
[1] "double"
class(df_iris$Sepal.Length)
[1] "numeric"

Q1. What are the differences ofdf_iris, slice(df_iris, 1:10) and glimpse(df_iris) above?

Q2. What are the differences ofdf_iris, slice(df_iris, 1:10) and glimpse(df_iris) in the console?


1.5.0.2 Summary of the Data

The following is very convenient to get the summary information of a data.

summary(df_iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Minimum, 1st Quadrant (25%), Median, Mean, 3rd Quadrant (75%), Maximum, and the count of each factor.


1.5.1 Visualizing Data

1.5.1.1 Scatter Plot

We use ggplot to draw graphs. The scatter plot is a projection of data with two variables \(x\) and \(y\).

ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
  geom_point()
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()

ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()


1.5.1.2 Scatter Plot with Labels

Add title and labels adding labs().

ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
  geom_point() +
  labs(title = "Title", x = "Label for x", y = "Label for y")

ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point() + 
  labs(title = "Scatter Plot of Sepal Data of Iris", x = "Sepal Length", y = "Sepal Width")


1.5.1.3 Scatter Plot with Colors

Add different colors automatically to each species. Can you see each group?

ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point()


1.5.1.4 Scatter Plot with Shapes

ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, shape = Species)) +
  geom_point()


1.5.1.5 Boxplot

The boxplot compactly displays the distribution of a continuous variable.

ggplot(data = df_iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot()


1.5.1.6 Histogram

Visualize the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin. Histograms (geom_histogram()) display the counts with bars.

ggplot(data = df_iris, aes(x = Sepal.Length)) +
  geom_histogram()


Change the number of bins by bins = <number>.

ggplot(data = df_iris, aes(x = Sepal.Length)) +
  geom_histogram(bins = 10)


1.5.2 Data Modeling

Professor Kaizoji will cover the mathematical models and hypothesis testings.

ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

1.6 Comments on Week 2

1.6.0.2 Practicum


1.6.0.3 Assignments - See Moodle

  1. Assignment Week 2-1: Introduction Plus Forum
  • Due: Tuesday, 20 December 2022, 11:59 PM
  1. Assignment Week 2-2: Quiz 1 on R Basics
  • Due: Tuesday, 20 December 2022, 11:59 PM

1.7 Swirl: An interactive learning environment for R and statistics

1.7.1 Swirl Courses

  1. R Programming: The basics of programming in R
  2. Regression Models: The basics of regression modeling in R
  3. Statistical Inference: The basics of statistical inference in R
  4. Exploratory Data Analysis: The basics of exploring data in R

You can install other swirl courses as well


1.7.2 Install and Start Swirl Courses

1.7.2.1 Three Steps to Start Swirl

install.packages("swirl") # Only the first time.
library(swirl) # Everytime you start swirl
swirl() # Everytime you start or resume swirl

1.7.2.2 R Programming: The basics of programming in R

 1: Basic Building Blocks      2: Workspace and Files     3: Sequences of Numbers    
 4: Vectors                    5: Missing Values          6: Subsetting Vectors      
 7: Matrices and Data Frames   8: Logic                   9: Functions               
10: lapply and sapply         11: vapply and tapply      12: Looking at Data         
13: Simulation                14: Dates and Times        15: Base Graphics          

1.7.2.4 Controling a swirl Session

  • … <– That’s your cue to press Enter to continue

  • You can exit swirl and return to the R prompt (>) at any time by pressing the Esc key.

  • If you are already at the prompt, type bye() to exit and save your progress. When you exit properly, you’ll see a short message letting you know you’ve done so.

When you are at the R prompt (>):

  1. Typing skip() allows you to skip the current question.
  2. Typing play() lets you experiment with R on your own; swirl will ignore what you do…
  3. UNTIL you type nxt() which will regain swirl’s attention.
  4. Typing bye() causes swirl to exit. Your progress will be saved.
  5. Typing main() returns you to swirl’s main menu.
  6. Typing info() displays these options again.

1.7.2.5 Final Remark

You will encounter the message like ‘Would you like to receive credit for completing this course on Coursera.org?’ at the end of each course. This is for coursera courses. Select ‘NO’.

1.8 More on R Script: Examples

1.8.1 R Scripts in Moodle

  • basics.R
  • coronavirus.R
  1. Copy a script in Moodle: {file name}.R
  2. In RStudio (Workspace in RStudio.cloud, Project in RStudio) choose File > New File > R Script and paste it.
  3. Choose File > Save with a name; e.g. {file names} (.R will be added automatically)

1.8.2 basics.R

The script with the outputs.

#################
#
# basics.R
#
################
# 'Quick R' by DataCamp may be a handy reference: 
#     https://www.statmethods.net/management/index.html
# Cheat Sheet at RStudio: https://www.rstudio.com/resources/cheatsheets/
# Base R Cheat Sheet: https://github.com/rstudio/cheatsheets/raw/main/base-r.pdf
# To execute the line: Control + Enter (Window and Linux), Command + Enter (Mac)
## try your experiments on the console

## calculator

3 + 7

### +, -, *, /, ^ (or **), %%, %/%

3 + 10 / 2

3^2

2^3

2*2*2

### assignment: <-, (=, ->, assign()) 

x <- 5

x 

#### object_name <- value, '<-' shortcut: Alt (option) + '-' (hyphen or minus) 
#### Object names must start with a letter and can only contain letter, numbers, _ and .

this_is_a_long_name <- 5^3

this_is_a_long_name

char_name <- "What is your name?"

char_name

#### Use 'tab completion' and 'up arrow'

### ls(): list of all assignments

ls()
ls.str()

#### check Environment in the upper right pane

### (atomic) vectors

5:10

a <- seq(5,10)

a

b <- 5:10

identical(a,b)

seq(5,10,2) # same as seq(from = 5, to = 10, by = 2)

c1 <- seq(0,100, by = 10)

c2 <- seq(0,100, length.out = 10)

c1

c2

length(c1)

#### ? seq   ? length   ? identical

(die <- 1:6)

zero_one <- c(0,1) # same as 0:1

die + zero_one # c(1,2,3,4,5,6) + c(0,1). re-use

d1 <- rep(1:3,2) # repeat


d1

die == d1

d2 <- as.character(die == d1)

d2

d3 <- as.numeric(die == d1)

d3

### class() for class and typeof() for mode
### class of vectors: numeric, charcters, logical
### types of vectors: doubles, integers, characters, logicals (complex and raw)

typeof(d1); class(d1)

typeof(d2); class(d2)

typeof(d3); class(d3)

sqrt(2)

sqrt(2)^2

sqrt(2)^2 - 2

typeof(sqrt(2))

typeof(2)

typeof(2L)

5 == c(5)

length(5)

### Subsetting

(A_Z <- LETTERS)

A_F <- A_Z[1:6]

A_F

A_F[3]

A_F[c(3,5)]

large <- die > 3

large

even <- die %in% c(2,4,6)

even

A_F[large]

A_F[even]

A_F[die < 4]

### Compare df with df1 <- data.frame(number = die, alphabet = A_F)
df <- data.frame(number = die, alphabet = A_F, stringsAsFactors = FALSE)

df

df$number

df$alphabet

df[3,2]

df[4,1]

df[1]

class(df[1])

class(df[[1]])

identical(df[[1]], die)

identical(df[1],die)

####################
# The First Example
####################

plot(cars)

# Help

? cars

# cars is in the 'datasets' package

data()

# help(cars) does the same as ? cars
# You can use Help tab in the right bottom pane

help(plot)
? par

head(cars)

str(cars)

summary(cars)

x <- cars$speed
y <- cars$dist

min(x)
mean(x)
quantile(x)

plot(cars)

abline(lm(cars$dist ~ cars$speed))

summary(lm(cars$dist ~ cars$speed))

boxplot(cars)

hist(cars$speed)
hist(cars$dist)
hist(cars$dist, breaks = seq(0,120, 10))

1.8.3 coronavirus.R

The script and its outputs. coronavirus.csv is very large

# https://coronavirus.jhu.edu/map.html
# JHU Covid-19 global time series data
# See R pakage coronavirus at: https://github.com/RamiKrispin/coronavirus
# Data taken from: https://github.com/RamiKrispin/coronavirus/tree/master/csv
# Last Updated
Sys.Date()

## Download and read csv (comma separated value) file
coronavirus <- read.csv("https://github.com/RamiKrispin/coronavirus/raw/master/csv/coronavirus.csv")
# write.csv(coronavirus, "data/coronavirus.csv")

## Summaries and structures of the data
head(coronavirus)
str(coronavirus)
coronavirus$date <- as.Date(coronavirus$date)
str(coronavirus)

range(coronavirus$date)
unique(coronavirus$country)
unique(coronavirus$type)

## Set Country
COUNTRY <- "Japan"
df0 <- coronavirus[coronavirus$country == COUNTRY,]
head(df0)
tail(df0)
(pop <- df0$population[1])
df <- df0[c(1,6,7,13)]
str(df)
head(df)
### alternatively,
head(df0[c("date", "type", "cases", "population")])
###

## Set types
df_confirmed <- df[df$type == "confirmed",]
df_death <- df[df$type == "death",]
df_recovery <- df[df$data_type == "recovery",]
head(df_confirmed)
head(df_death)
head(df_recovery)

## Histogram
plot(df_confirmed$date, df_confirmed$cases, type = "h")
plot(df_death$date, df_death$cases, type = "h")
# plot(df_recovered$date, df_recovered$cases, type = "h") # no data for recovery

## Scatter plot and correlation
plot(df_confirmed$cases, df_death$cases, type = "p")
cor(df_confirmed$cases, df_death$cases)


## In addition set a period
start_date <- as.Date("2021-07-01")
end_date <- Sys.Date() 
df_date <- df[df$date >=start_date & df$date <= end_date,]
##

## Set types
df_date_confirmed <- df_date[df_date$type == "confirmed",]
df_date_death <- df_date[df_date$type == "death",]
df_date_recovery <- df_date[df_date$data_type == "recovery",]
head(df_date_confirmed)
head(df_date_death)
head(df_date_recovery)

## Histogram
plot(df_date_confirmed$date, df_date_confirmed$cases, type = "h")
plot(df_date_death$date, df_date_death$cases, type = "h")
# plot(df_date_recovered$date, df_date_recovered$cases, type = "h") # no data for recovery

plot(df_date_confirmed$cases, df_date_death$cases, type = "p")
cor(df_date_confirmed$cases, df_date_death$cases)

### Q0. Change the values of the location and the period and see the outcomes.
### Q1. What is the correlation between df_confirmed$cases and df_death$cases?
### Q2. Do we have a larger correlation value if we shift the dates to implement the time-lag?
### Q3. Do you have any other questions to explore?

#### Extra
plot(df_confirmed$date, df_confirmed$cases, type = "h", 
     main = paste("Comfirmed Cases in",COUNTRY), 
     xlab = "Date", ylab = "Number of Cases")

:::

1.9 gapminder Package

1.9.1 Hans Rosling (1948 – 2017)

Hans Rosling was a Swedish physician, academic, and public speaker. He was a professor of international health at Karolinska Institute[4] and was the co-founder and chairman of the Gapminder Foundation, which developed the Trendalyzer software system. (wikipedia)


1.9.2 Factfulness is … From the book

recognizing when a decision feels urgent and remembering that it rarely is.

To control the urgency instinct, take small steps.

  • Take a breath. When your urgency instinct is triggered, your other instincts kick in and your analysis shuts down. Ask for more time and more information. It’s rarely now or never and it’s rarely either/or.
  • Insist on the data. If something is urgent and important, it should be measured. Beware of data that is relevant but inaccurate, or accurate but irrelevant. Only relevant and accurate data is useful.
  • Beware of fortune-tellers. Any prediction about the future is uncertain. Be wary of predictions that fail to acknowledge that. Insist on a full range of scenarios, never just the best or worst case. Ask how often such predictions have been right before.
  • Be wary of drastic action. Ask what the side effects will be. Ask how the idea has been tested. Step-by-step practical improvements, and evaluation of their impact, are less dramatic but usually more effective.

# install.packages("gapminder")
library(gapminder)
df <- gapminder
df

glimpse(df)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "A…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987,…
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438,…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460,…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.981…

summary(df)
        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  
                                       

1.9.3 Questions

  • List questions based on this data.
  • What do you want to see?
  • What kind of chart do you want to construct?

Review

  • R on R Studio/Posit Cloud (RStudio Cloud)
  • Three ways to run codes
    1. Console
    2. R Script
    3. Code Chunk in R Notebook
  • Packages
    1. tidyverse
    2. rmarkdown
    3. gapminder

EDA (A diagram from R4DS by H.W. and G.G.)

EDA from r4ds

Today: R Markdown and dplyr

2 Exploratory Data Analysis II

2.1 R Markdown

What is R Markdown: https://vimeo.com/178485416

  • Code Chunks
  • Text
  • YAML Metadata

2.1.1 What is R Markdown and R Notebook

R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both

  • save and execute code
  • generate high quality reports that can be shared with an audience

R Notebooks are an implementation of Literate Programming that allows for direct interaction with R while producing a reproducible document with publication-quality output.

An R Notebook is an R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input.

(Reference: R Markdown: The Definitive Guide, 3.2 Notebook)


2.1.1.1 Two Goodies

  • Important: Implementation of Reproducible Research and Literate Programming

  • Useful to Render into Various Formats: R Notebook (HTML), R Markdown (HTML), PDF, MS Word, MS Powerpoint, Ioslides Presentation (HTML), Slidy Presentation (HTML), Beamer Presentation (PDF), etc.


2.1.2 Reproducible Research and Literate Programming

2.1.2.1 Literate Programming by D. Knuth

Literate programming is an approach to programming introduced by Donald Knuth in which a program is given as an explanation of the program logic in a natural language, such as English, interspersed with snippets of macros and traditional source code, from which a compilable source code can be generated

2.1.2.2 D. Knuth

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.


2.1.2.3 Reproducible Research - Quote from a Coursera Course

Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. The need for reproducibility is increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary. In addition, reproducibility makes an analysis more useful to others because the data and code that actually conducted the analysis are available.


2.1.2.4 R Markdown workflow, R for Data Science

R Markdown is also important because it so tightly integrates prose and code. This makes it a great analysis notebook because it lets you develop code and record your thoughts. It:

  • Records what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!

  • Supports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.

  • Helps others understand your work. It is rare to do data analysis by yourself, and you’ll often be working as part of a team. A lab notebook helps you share why you did it with your colleagues or lab mates.


2.1.2.5 Records of EDA and Communication

  1. Memo on a scratch paper: R Scripts
  2. Record on a notebook: R Notebook (an R Markdown format)
  3. Short paper or a digital communication: R Notebook
  4. Paper or a report: R Markdown (html, pdf, MS Word, etc.)
  5. Presentation (html, pdf, MS Powerpoint, etc.)
  6. Publication of a Book

2.1.3 Let’s Get Started

  1. Start R Studio - Update R Studio if old
  2. Create a Project
  3. Tool > Install Packages rmarkdown
    • Or on Console: install.packages("rmarkdown")
  4. Tool > Install Packages tinytex (for pdf generation)
    • Alternatively, install.packages('tinytex')
    • If TeX is not installed: tinytex::install_tinytex() # install TinyTeX
      • If you are not sure, please check on Terminal in the left below pane:
        • which latex, which mktexlsr
  5. Let’s try!
    1. File > New File > R Notebook
    2. Save with a file name, say, test-notebook
    3. Preview by [Preview] button
    4. Run Code Chunk plot(cars) and then Preview again.
    5. Knit PDF, Word (and HTML)

2.1.4 Templates

2.1.4.1 RNotebook_Template

Template to submit your assignment of this course: RNotebook_Template.nb.html

title: "Title of R Notebook"
author: "ID and Your Name"
date: "2023-01-26" 
output:
  html_notebook: null
2.1.4.1.1 YAML
  • Change the title
  • Write ID and your name
  • Date is auto-generated and inserted. If you wish, you can replace “2023-01-26” by your favorite date style.

2.1.4.1.2 Code Chunk
  • When you execute or run a code within the notebook, the results appear beneath the code.
  • Try executing this chunk by clicking the Run button, a triangle pointing right, within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter (Win) or Cmd+Shift+Enter (Mac).
  • Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Option+I (Win) or Cmd+Option+I (Mac).
  • When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K (Win) or Cmd+Shift+K (Mac) to preview the HTML file).
  • The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

2.1.4.2 Testing R Markdown Formats

Various Output Formats: test-rmarkdown.nb.html

title: "Testing R Markdown Formats"
author: "DS-SL"
date: "2023-01-26"
output:
  html_notebook:
    number_sections: yes
  pdf_document: 
    number_sections: yes
  html_document:
    df_print: paged
    number_sections: yes
  word_document: 
    number_sections: yes
  powerpoint_presentation: default
  ioslides_presentation:
    widescreen: yes
    smaller: yes
  slidy_presentation: default
  beamer_presentation: default

2.1.4.3 Comments on Presentation Formats and Options

  • For slides, a new slide starts at ##, the second-level heading.
  • --- is page break for presentation formats.
  • For Word and Powerpoint, you can add your template. See the documents in References
    • Use R Markdown to create a Word document [similar for PowerPoint]
    • Save the rendered Word file as: ref-doc-style.docx
    • Edit the styles of the file ref-doc-style.docx
    • Add ref-doc-style.docx as reference_doc in YAML with indention as below
  word_document: 
    number_sections: yes
    reference_doc: ref-doc-style.docx
  powerpoint_presentation: 
    reference_doc: ref-ppt-style.pptx
  • You can use Output Options at the bottom of the gear icon next to Preview/knit button.

2.1.5 Markdown Language – or use WYSIWYG editor

  • Headers: #, ##, ###, ####
  • Lists: 1. 2. , *
  • Links: linked phrase
  • Images: ![alt text](figures/filename.jpg)
  • Block quotes” > (block)
  •   equations: e.g. $\frac{a}{b}$ for \(\frac{a}{b}\)
  • Horizontal rules: Three or more asterisks or dashes (*** or - - - )
  • Tables
  • Footnotes
  • Bibliographies and Citations
  • Slide breaks
  • Italicized text by _italic_, Bold text by **bold**
  • Superscripts, Subscripts, Strikethrough text

2.1.5.1 Visual R Markdown

R Studio introduced Visual Editor towards the end of 2021. It seems to be stable but it is not perfect to go back and forth from the original editor using tags. I always use the original editor and I am confident on all the functions of it but I do not have much experience on Visual Editor. [My Note in QALL401 2021]


2.1.6 References

2.2 Data Transforamtion with dplyr

2.2.1 dplyr Overview

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • mutate() adds new variables that are functions of existing variables
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.
  • group_by() takes an existing tbl and converts it into a grouped tbl.

You can learn more about them in vignette(“dplyr”). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette(“two-table”).

If you are new to dplyr, the best place to start is the data transformation chapter in R for data science.


2.2.2 select: Subset columns using their names and types

Helper Function Use Example
- Columns except select(babynames, -prop)
: Columns between (inclusive) select(babynames, year:n)
contains() Columns that contains a string select(babynames, contains(“n”))
ends_with() Columns that ends with a string select(babynames, ends_with(“n”))
matches() Columns that matches a regex select(babynames, matches(“n”))
num_range() Columns with a numerical suffix in the range Not applicable with babynames
one_of() Columns whose name appear in the given set select(babynames, one_of(c(“sex”, “gender”)))
starts_with() Columns that starts with a string select(babynames, starts_with(“n”))

2.2.3 filter: Subset rows using column values

Logical operator tests Example
> Is x greater than y? x > y
>= Is x greater than or equal to y? x >= y
< Is x less than y? x < y
<= Is x less than or equal to y? x <= y
== Is x equal to y? x == y
!= Is x not equal to y? x != y
is.na() Is x an NA? is.na(x)
!is.na() Is x not an NA? !is.na(x)

2.2.4 arrange and Pipe %>%

  • arrange() orders the rows of a data frame by the values of selected columns.

Unlike other dplyr verbs, arrange() largely ignores grouping; you need to explicitly mention grouping variables (`or use .by_group = TRUE) in order to group by them, and functions of variables are evaluated once per data frame, not once per group.

  • pipes in R for Data Science.

2.2.5 mutate

  • Create, modify, and delete columns

  • Useful mutate functions

    • +, -, log(), etc., for their usual mathematical meanings

    • lead(), lag()

    • dense_rank(), min_rank(), percent_rank(), row_number(), cume_dist(), ntile()

    • cumsum(), cummean(), cummin(), cummax(), cumany(), cumall()

    • na_if(), coalesce()### group_by() and summarise()


2.2.6 group_by


2.2.7 summarise or summarize

2.2.7.1 Summary functions

So far our summarise() examples have relied on sum(), max(), and mean(). But you can use any function in summarise() so long as it meets one criteria: the function must take a vector of values as input and return a single value as output. Functions that do this are known as summary functions and they are common in the field of descriptive statistics. Some of the most useful summary functions include:

  1. Measures of location - mean(x), median(x), quantile(x, 0.25), min(x), and max(x)
  2. Measures of spread - sd(x), var(x), IQR(x), and mad(x)
  3. Measures of position - first(x), nth(x, 2), and last(x)
  4. Counts - n_distinct(x) and n(), which takes no arguments, and returns the size of the current group or data frame.
  5. Counts and proportions of logical values - sum(!is.na(x)), which counts the number of TRUEs returned by a logical test; mean(y == 0), which returns the proportion of TRUEs returned by a logical test.
  • if_else(), recode(), case_when()

2.2.8 Learn dplyr by Examples

2.2.8.1 Data iris

iris

summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

2.2.8.2 select 1 - columns 1, 2, 5

select(iris, c(1,2,5))

2.2.8.3 select 2 - except Species

select(iris, -Species)

2.2.8.4 select 3 - change column names

select(iris, sl = Sepal.Length, sw = Sepal.Width, sp = Species)

2.2.8.5 filter - by names

filter(iris, Species == "virginica")

2.2.8.6 arrange - ascending and descending order

arrange(iris, Sepal.Length, desc(Sepal.Width))

2.2.8.7 mutate - rank

iris %>% mutate(sl_rank = min_rank(Sepal.Length)) %>% arrange(sl_rank)

2.2.8.8 group_by and summarize

iris %>% 
  group_by(Species) %>% 
  summarize(sl = mean(Sepal.Length), sw = mean(Sepal.Width), 
  pl = mean(Petal.Length), pw = mean(Petal.Width))
  • mean: mean() or mean(x, na.rm = TRUE) - arithmetic mean (average)
  • median: median() or median(x, na.rm = TRUE) - mid value

For more examples see

dplr_iris

2.2.9 References of dplyr

2.2.9.1 RStudio Primers: See References in Moodle at the bottom

  1. The Basics – r4ds: Explore, I
  1. Work with Datar4ds: Wrangle, I
  • Working with Tibbles
  • Isolating Data with dplyr
  • Deriving Information with dplyr
  1. Visualize Data – r4ds: Explore, II
  2. Tidy Your Data – r4ds: Wrangle, II
  3. Iterate – r4ds: Program
  4. Write Functions – r4ds: Program

2.2.10 Learn dplyr by Examples II - gapminder

2.2.10.1 ggplot2 Overview

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

Examples

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) + 
  geom_boxplot(mapping = aes(x = class, y = hwy))

Template

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

2.2.10.2 Gapminder and R Package gapminder

Gapminder was founded by Ola Rosling, Anna Rosling Rönnlund, and Hans Rosling


library(tidyverse)
library(gapminder)
library(WDI)

2.2.10.3 R Package gapminder data

df <- gapminder
df

glimpse(df)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "A…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987,…
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438,…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460,…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.981…

summary(df)
        country        continent        year         lifeExp     
 Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
 Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
 Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
 Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
 Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
 Australia  :  12                  Max.   :2007   Max.   :82.60  
 (Other)    :1632                                                
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  
                                       

2.2.10.4 Tidyverse::ggplot

2.2.10.4.1 First Try - with failures
ggplot(df, aes(x = year, y = lifeExp)) + geom_point()


ggplot(df, aes(x = year, y = lifeExp)) + geom_line()


ggplot(df, aes(x = year, y = lifeExp)) + geom_boxplot()


typeof(pull(df, year)) # same as typeof(df$year)
[1] "integer"

ggplot(df, aes(y = lifeExp, group = year)) + geom_boxplot()


2.2.10.4.2 Box Plot
ggplot(df, aes(x = as_factor(year), y = lifeExp)) + geom_boxplot()


2.2.10.5 Applications of dplyr

2.2.10.5.1 filter
df %>% filter(country == "Afghanistan") %>%
  ggplot(aes(x = year, y = lifeExp)) + geom_line()


df %>% filter(country %in% c("Afghanistan", "Japan")) %>%
  ggplot(aes(x = year, y = lifeExp, color = country)) + geom_line()


df %>% distinct(country) %>% pull()
  [1] Afghanistan              Albania                 
  [3] Algeria                  Angola                  
  [5] Argentina                Australia               
  [7] Austria                  Bahrain                 
  [9] Bangladesh               Belgium                 
 [11] Benin                    Bolivia                 
 [13] Bosnia and Herzegovina   Botswana                
 [15] Brazil                   Bulgaria                
 [17] Burkina Faso             Burundi                 
 [19] Cambodia                 Cameroon                
 [21] Canada                   Central African Republic
 [23] Chad                     Chile                   
 [25] China                    Colombia                
 [27] Comoros                  Congo, Dem. Rep.        
 [29] Congo, Rep.              Costa Rica              
 [31] Cote d'Ivoire            Croatia                 
 [33] Cuba                     Czech Republic          
 [35] Denmark                  Djibouti                
 [37] Dominican Republic       Ecuador                 
 [39] Egypt                    El Salvador             
 [41] Equatorial Guinea        Eritrea                 
 [43] Ethiopia                 Finland                 
 [45] France                   Gabon                   
 [47] Gambia                   Germany                 
 [49] Ghana                    Greece                  
 [51] Guatemala                Guinea                  
 [53] Guinea-Bissau            Haiti                   
 [55] Honduras                 Hong Kong, China        
 [57] Hungary                  Iceland                 
 [59] India                    Indonesia               
 [61] Iran                     Iraq                    
 [63] Ireland                  Israel                  
 [65] Italy                    Jamaica                 
 [67] Japan                    Jordan                  
 [69] Kenya                    Korea, Dem. Rep.        
 [71] Korea, Rep.              Kuwait                  
 [73] Lebanon                  Lesotho                 
 [75] Liberia                  Libya                   
 [77] Madagascar               Malawi                  
 [79] Malaysia                 Mali                    
 [81] Mauritania               Mauritius               
 [83] Mexico                   Mongolia                
 [85] Montenegro               Morocco                 
 [87] Mozambique               Myanmar                 
 [89] Namibia                  Nepal                   
 [91] Netherlands              New Zealand             
 [93] Nicaragua                Niger                   
 [95] Nigeria                  Norway                  
 [97] Oman                     Pakistan                
 [99] Panama                   Paraguay                
[101] Peru                     Philippines             
[103] Poland                   Portugal                
[105] Puerto Rico              Reunion                 
[107] Romania                  Rwanda                  
[109] Sao Tome and Principe    Saudi Arabia            
[111] Senegal                  Serbia                  
[113] Sierra Leone             Singapore               
[115] Slovak Republic          Slovenia                
[117] Somalia                  South Africa            
[119] Spain                    Sri Lanka               
[121] Sudan                    Swaziland               
[123] Sweden                   Switzerland             
[125] Syria                    Taiwan                  
[127] Tanzania                 Thailand                
[129] Togo                     Trinidad and Tobago     
[131] Tunisia                  Turkey                  
[133] Uganda                   United Kingdom          
[135] United States            Uruguay                 
[137] Venezuela                Vietnam                 
[139] West Bank and Gaza       Yemen, Rep.             
[141] Zambia                   Zimbabwe                
142 Levels: Afghanistan Albania Algeria Angola ... Zimbabwe

df %>% filter(country %in% c("Brazil", "Russia", "India", "China")) %>%
  ggplot(aes(x = year, y = lifeExp, color = country)) + geom_line()

Russian data is missing.


2.2.11 Exercises

  1. Change lifeExp to pop and gdpPercap and do the same.
  2. Choose ASEAN countries and do the similar investigations.
  • Brunei, Cambodia, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore.
  1. Choose several countries by yourself and do the similar investigations.

2.2.12 group_by and summarize

Let us use the variable continent and summarize the data.

df_lifeExp <- df %>% group_by(continent, year) %>% 
  summarize(mean_lifeExp = mean(lifeExp), median_lifeExp = median(lifeExp), max_lifeExp = max(lifeExp), min_lifeExp = min(lifeExp), .groups = "keep")

df_lifeExp

df %>% filter(year %in% c(1952, 1987, 2007)) %>%
  ggplot(aes(x=as_factor(year), y = lifeExp, fill = continent)) +
  geom_boxplot()


df_lifeExp %>% ggplot(aes(x = year, y = mean_lifeExp, color = continent)) +
  geom_line()


df_lifeExp %>% ggplot(aes(x = year, y = mean_lifeExp, color = continent, linetype = continent)) +
  geom_line()


df_lifeExp %>% ggplot() +
  geom_line(aes(x = year, y = mean_lifeExp, color = continent)) + 
  geom_line(aes(x = year, y = median_lifeExp, linetype = continent))

2.3 The Week Three Assignment (in Moodle)

R Markdown and dplyr

  • Create an R Notebook of a Data Analysis containing the following and submit the rendered HTML file (eg. a2_123456.nb.html)
    1. create an R Notebook using the R Notebook Template in Moodle, save as a2_123456.Rmd,
    2. write your name and ID and the contents,
    3. run each code block,
    4. preview to create a2_123456.nb.html,
    5. submit a2_123456.nb.html to Moodle.
  1. Pick data from the built-in datasets besides cars. (library(help = "datasets") or go to the site The R Datasets Package)

    • Information of the data: Name, Description, Usage, Format, Source, References (Hint: ?cars)
    • Use head(), str(), …, and create at least one chart using ggplot2 - Code Chunk.
      • Don’t forget to add library(tidyverse) in the first code chunk.
    • An observation of the chart - in your own words.

  1. Load gapminder by library(gapminder).

    • Choose pop or gdpPercap, or both, one country in the data, a group of countries in the data.
    • Create charts using ggplot2 with geom_line and the variables and countries chosen in 1. (See examples of the charts for lifeExp.)
    • Study the data as you like.
    • Observations and difficulties encountered.

Due: 2023-01-09 23:59:00. Submit your R Notebook file in Moodle (The Second Assignment). Due on Monday!


2.3.1 Original Data? WDI?

gapminder

2.3.1.1 WDI

  • SP.DYN.LE00.IN: Life expectancy at birth, total (years)
  • NY.GDP.PCAP.KD: GDP per capita (constant 2015 US$)
  • SP.POP.TOTL: Population, total
df_wdi <- WDI(
  country = "all", 
  indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD")
)
source("~/Documents/_class/gsclasses/2022/da4r2022_note/docs/eda5.Rmd")
Error in source("~/Documents/_class/gsclasses/2022/da4r2022_note/docs/eda5.Rmd") : 
  ~/Documents/_class/gsclasses/2022/da4r2022_note/docs/eda5.Rmd:24:4: unexpected numeric constant
23: 
24: 1. 2022.12
       ^

df_wdi

df_wdi_extra <- WDI(
  country = "all", 
  indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD"), 
  extra = TRUE
)

df_wdi_extra

3 Exploratory Data Analysis III

3.1 Importing Public Data, WDI

3.1.1 Reviews and Previews

library(tidyverse)
library(gapminder)
library(maps)
library(WDI)
library(readxl)
library(ggrepel)
  • We have used tidyverse and gapminder already.
  • If you have not installed WDI, install it.
  • We will not use ggrepel but if you want to use it, install it.
  • maps and readxl are bundled in tidyverse but need to be attached by library.

3.1.1.1 Gapminder Package Data

df <- gapminder
df

3.1.1.2 gdpPercap of ASEAN countries

asean <- c("Brunei", "Cambodia", "Laos", "Myanmar", 
           "Philippines", "Indonesia", "Malaysia", "Singapore")
df %>% filter(country %in% asean) %>%
  ggplot(aes(x = year, y = gdpPercap, col = country)) + geom_line()


df %>% filter(country %in% asean) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp, col = country)) + geom_point()


df %>% filter(country %in% asean) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp, col = country)) + 
  geom_point() + coord_trans(x = "log10", y = "identity")

\(\log_{10}{100}\) = 2, \(\log_{10}{1000}\) = 3, \(\log_{10}{10000}\) = 4


library(ggrepel)
df2007 <- df %>% filter(country %in% asean, year == 2007)
df %>% filter(country %in% asean) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp, col = country))+ 
  geom_line() + geom_label_repel(data = df2007, aes(label = country)) + geom_point()  +
  coord_trans(x = "log10", y = "identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1), legend.position = "none") +
  labs(title = "Life Expectancy vs GDP Per Capita of ASEAN Countries",
       subtitle = "Data: gapminder package", x = "GDP per Capita", y = "Life Expectancy")



3.1.1.3 World Bank: World Development Indicators (WDI)

  • SP.DYN.LE00.IN: Life expectancy at birth, total (years)
  • NY.GDP.PCAP.KD: GDP per capita (constant 2015 US$)
  • SP.POP.TOTL: Population, total
df_wdi <- WDI(
  country = "all", 
  indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD")
)

df_wdi

df_wdi_extra <- WDI(
  country = "all", 
  indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD"), 
  extra = TRUE
)

df_wdi_extra

3.1.2 Exploratory Data Analysis

3.1.2.1 What is EDA (Posit Primers: Visualise Data)

  1. EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

  2. Generate questions about your data

  3. Search for answers by visualising, transforming, and/or modeling your data

Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.


3.1.3 Open and Public Data, World Bank

3.1.3.1 Open Government Data Toolkit: Open Data Defined

The term Open Data has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.

  1. The data must be legally open, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
  2. The data must be technically open, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. To make Open Data easier to find, most organizations create and manage Open Data catalogs.

3.1.4 World Bank: WDI - World Development Indicaters

  • World Bank: https://www.worldbank.org
  • Who we are:
    • To end extreme poverty: By reducing the share of the global population that lives in extreme poverty to 3 percent by 2030.
    • To promote shared prosperity: By increasing the incomes of the poorest 40 percent of people in every country.
  • World Bank Open Data: https://data.worldbank.org
    • Data Bank, World Development Indicators, etc.
  • World Development Indicators (WDI) : the World Bank’s premier compilation of cross-country comparable data on development; 1400 time series indicators
    • Themes: Poverty and Inequality, People, Environment, Economy, States and Markets, Global Links
    • Open Data & DataBank: Explore data, Query database
    • Bulk Download: Excel, CSV
    • API Documentation

3.1.5 R Package WDI

  • WDI: World Development Indicators and Other World Bank Data
  • Search and download data from over 40 databases hosted by the World Bank, including the World Development Indicators (‘WDI’), International Debt Statistics, Doing Business, Human Capital Index, and Sub-national Poverty indicators.
  • Version: 2.7.4
  • Materials: README - usage
    • NEWS - version history
  • Published: 2021-04-06
  • README: https://cran.r-project.org/web/packages/WDI/readme/README.html
  • Reference manual: WDI.pdf

3.1.6 Function WDI

  • Usage
WDI(country = "all",
    indicator = "NY.GDP.PCAP.KD",
    start = 1960,
    end = 2020,
    extra = FALSE,
    cache = NULL)
  • Arguments See Help!
    • country: Vector of countries (ISO-2 character codes, e.g. “BR”, “US”, “CA”, or “all”)
    • indicator: If you supply a named vector, the indicators will be automatically renamed: c('women_private_sector' = 'BI.PWK.PRVS.FE.ZS')

3.1.7 Function WDIsearch

library(WDI)
WDIsearch(string = "NY.GDP.PCAP.KD", 
          field = "indicator", cache = NULL)

WDIsearch(string = "population", 
          field = "name", short=FALSE, cache = NULL)

WDIsearch(string = "NY.GDP.PCAP.KD", 
  field = "indicator", short = FALSE, cache = NULL)
WDIsearch(string = "gdp", 
  field = "name", short = TRUE, cache = NULL) 

3.1.8 Bulk Downloads at WDI site

WDIbulk downloads the zip file of Bulk Downloads in WDI site , it is a list containing 6 data frames: Data, Country, Series, Country-Series, Series-Time, FootNote.


3.1.9 WDIcache

Download an updated list of available WDI indicators from the World Bank website. Returns a list for use in the WDIsearch function.

wdi_cache <- WDIcache()

Downloading all series information from the World Bank website can take time. The WDI package ships with a local data object with information on all the series available on 2012-06-18. You can update this database by retrieving a new list using WDIcache, and then feeding the resulting object to WDIsearch via the cache argument.


wdi_cache

3.1.10 WDI_data

List of 2 data frames

The first character matrix includes a full list of WDI series. This list is updated semi-regularly. Users can refresh the list manually using the ‘WDIcache()’ function and search in the updated list using the ‘cache’ argument.

WDI_data$country  %>% filter(country == "Japan")

WDIsearch(string = "gdp", 
  field = "name", short = FALSE, cache = NULL) #cache = wdi_cache

3.1.11 World Development Indicators - Summary

Find indicators:

  1. WDIsearch(string = "gdp", field = "name", short = FALSE, cache = NULL)
  • WDIsearch(string = "gdp", field = "name", short = FALSE, cache = wdi_cache)
  • WDIsearch(string = "NY.GDP.PCAP.KD", field = "indicator", short = FALSE, cache = NULL)
  1. WDI: Data Themes
  2. Browse by Indicators: https://data.worldbank.org/indicator
    • Featured Indicators or All Indicators
    • Obtain the indicator from the detail or the URL

3.1.11.1 Example: CO2 emissions (metric tons per capita)

WDIsearch(string = "EN.ATM.CO2E.PC", field = "indicator", 
          short = FALSE, cache = NULL) #cache = wdi_cache
WDIsearch(string = "EN.ATM.CO2E.PC", field = "indicator", 
          short = FALSE, cache = NULL) %>% pull(description) #cache = wdi_cache
[1] "Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring."
  • Source: Climate Watch. 2020. GHG Emissions. Washington, DC: World Resources Institute. Available at: climatewatchdata.org/ghg-emissions. See SP.POP.TOTL for the denominator’s source.

co2pcap <- WDI(country = "all", indicator = "EN.ATM.CO2E.PC", start = 1960, end = NULL, extra = TRUE, cache = NULL) #cache = wdi_cache
co2pcap

write_csv(co2pcap, "data/co2pcap.csv")

co2pcap %>% filter(country %in% c("World", "Japan", "United States", "China")) %>%
  ggplot(aes(x = year, y = EN.ATM.CO2E.PC, color = country)) + 
  geom_line()


co2pcap %>% filter(!is.na(EN.ATM.CO2E.PC)) %>% pull(year) %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1990    1997    2005    2005    2012    2019 

co2pcap %>% 
  filter(country %in% c("World", "Japan", "United States", "China"), year %in% 1990:2019) %>%
  ggplot(aes(x = year, y = EN.ATM.CO2E.PC, color = country)) + 
  geom_line()


co2pcap %>% 
  filter(income != "Aggregates", year == 2019) %>%
  ggplot(aes(x = income, y = EN.ATM.CO2E.PC, fill = income)) + 
  geom_boxplot()


co2pcap %>% 
  filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
  ggplot(aes(x = income, y = EN.ATM.CO2E.PC, fill = income)) + 
  geom_boxplot()


co2pcap %>% 
  filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
  group_by(income) %>%
  summarize(min = min(EN.ATM.CO2E.PC), med = median(EN.ATM.CO2E.PC), max = max(EN.ATM.CO2E.PC), IQR = IQR(EN.ATM.CO2E.PC), n = n())

co2pcap %>% 
  filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
  filter(!income %in% c("High income", "Low income", "Lower middle income", "Upper middle income"))
co2pcap %>% 
  filter(income != "Aggregates", year == 2019) %>%
  filter(income == "Not classified")

co2pcap %>% 
  filter(income != "Aggregates", year == 2019) %>%
  ggplot(aes(map_id = country)) + 
  geom_map(aes(fill = income), map = world_map) + expand_limits(x = world_map$long, y = world_map$lat) +
  labs(title = "Income Levels in 2019")


co2pcap %>% distinct(country)

world_map %>% distinct(region)

world_map0 <- world_map %>% 
  mutate(region = case_when(region == "Macedonia" ~ "North Macedonia",
                            region == "Ivory Coast"  ~ "Cote d'Ivoire",
                            region == "Democratic Republic of the Congo"  ~ "Congo, Dem. Rep.",
                            region == "Republic of Congo" ~  "Congo, Rep.",
                            region == "UK" ~  "United Kingdom",
                            region == "USA" ~  "United States",
                            region == "Laos" ~  "Lao PDR",
                            region == "Slovakia" ~  "Slovak Republic",
                            region == "Saint Lucia" ~  "St. Lucia",
                            region == "Kyrgyzstan"  ~  "Kyrgyz Republic",
                            region == "Micronesia" ~ "Micronesia, Fed. Sts.",
                            region == "Swaziland"  ~ "Eswatini", 
                            region == "Virgin Islands"  ~ "Virgin Islands (U.S.)", 
                            region == "Russia" ~ "Russian Federation", 
                            region == "Egypt" ~ "Egypt, Arab Rep.",
                            region == "South Korea" ~ "Korea, Rep.",
                            region == "North Korea" ~ "Korea, Dem. People's Rep.",
                            region == "Iran" ~ "Iran, Islamic Rep.",
                            region == "Brunei" ~ "Brunei Darussalam",
                            region == "Venezuela" ~ "Venezuela, RB",
                            region == "Yemen" ~ "Yemen, Rep.",
                            region == "Bahamas" ~ "Bahamas, The",
                            region == "Syria" ~ "Syrian Arab Republic",
                            region == "Turkey" ~ "Turkiye",
                            region == "Cape Verde" ~ "Cabo Verde",
                            region == "Gambia" ~ "Gambia, The",
                            region == "Czech Republic" ~ "Czechia",
                            TRUE ~ region))

write_csv(world_map0, "data/world_map0.csv")
map0_url <- "https://icu-hsuzuki.github.io/da4r2022_note/data/world_map0.csv"
world_map0 <- read_csv(map0_url)

co2pcap %>% filter(income != "Aggregates", year == 2019) %>% 
  anti_join(world_map0, by = c("country"="region"))

world_map0 %>% anti_join(co2pcap, by = c("region"="country")) %>% distinct(region) %>% arrange(region)

world_map0 %>% left_join(iso3166, by = c("region" = "ISOname")) %>%
  filter(is.na(a2)) %>% distinct(region)

3.2 Data Visualization and `ggplot2’

3.2.1 Learning Resouces

  • Posit Primers:
    • Visualize Data: Learn how to use ggplot2 to make any type of plot with your data. Then learn the best ways to visualize patterns within values and relationships between variables.
  • r4ds: Data Visualization

3.2.2 Exploratory Data Analysis

3.2.2.1 What is EDA?

EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

  1. Generate questions about your data

  2. Search for answers by visualising, transforming, and/or modeling your data

  3. Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.


3.2.2.2 Two useful questions

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables?

The rest of this tutorial will look at these two questions. To make the discussion easier, let’s define some terms…


3.2.3 Data Visualization

3.2.4 ggplot2 Basics

visualization


3.2.5 Example: World Inequility Report - WIR2022

library(readxl)
url_summary <- "https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx"
download.file(url = url_summary, destfile = "data/WIR2022s.xlsx") 
excel_sheets("data/WIR2022s.xlsx")

3.2.6 F14: Global carbon inequality, 2019. Group contribution to world emissions (%)

Note that the sheet name of F14 has period at the end.

df_f14 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F14.")
Error: `path` does not exist: ‘data/WIR2022s.xlsx’
  • \n for line break in the title.

3.2.6.1 Categorical vs Continuous Value

df_f14 %>% 
  ggplot(aes(x = Group, y = Share)) +
  geom_col()

df_f14 %>% 
  ggplot(aes(x = Group, y = Share)) +
  geom_col(width = 0.5, fill = scales::hue_pal()(1)[1]) + 
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 14. Global carbon inequality, \n2019 Group contribution to world emissions (%)", 
       x = "", y = "Share of world emissions (%)")

3.2.6.2 Memo

  • width = 0.5: width of bars
  • fill = scales::hue_pal()(1)[1]): hue scale
  • scale_y_continuous(labels = scales::percent_format(accuracy = 1)): percent format
    • if accuracy = 0.1, we have 10.0% etc.
  • labs(title = "Figure 14. Global carbon inequality, \n2019 Group contribution to world emissions (%)", x = "", y = "Share of world emissions (%)")
    • title = ““: \n is for line feed
    • x, y: labels of x-axis and y-axis

3.2.7 F12: Female share in global labor incomes, 1990-2020

df_f12 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F12")
df_f12

df_f12 %>% 
  select(year = "Data needs to be updated", value = ...2) %>%
  filter(!is.na(year)) %>%
  ggplot(aes(x = year, y = value)) +
  geom_col(width = 0.5, fill = scales::hue_pal()(2)[2])

df_f12 %>% 
  select(year = "Data needs to be updated", value = ...2) %>%
  filter(!is.na(year)) %>%
  ggplot(aes(x = year, y = value)) +
  geom_col(width = 0.5, fill = scales::hue_pal()(2)[2]) +
  geom_hline(yintercept = 0.5, linetype = 2, colour = scales::hue_pal()(2)[1]) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 12. Female share in global labor incomes, 1990-2020", 
        x = "", y = "") +
  annotate("text", x = 1, y = 0.48, label = "Gender parity", size = 3) +
  annotate("text", x = 5.2, y = 0.47, label = stringr::str_wrap("Women make only 35% of global labor incomes, men make the remaining  65%.", width = 40), size = 3)

3.2.8 F1: Global income and wealth inequality, 2021

df_f1 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F1")
df_f1

df_f1_rev %>%
  ggplot(aes(x = cat, y = value, fill = group)) +
  geom_col(position = "dodge")

3.2.9 References of ggplot2

3.2.9.1 RStudio Primers: See References in Moodle at the bottom

Visualize Data

  • Exploratory Data Analysis
  • Bar Charts
  • Histograms
  • Boxplots and Counts
  • Scatterplots
  • Line Plots
  • Overplotting and Big Data
  • Customize Your Plots

3.3 The Week Four Assignment (in Moodle)

WDI and ggplot2

  • Create an R Notebook of a Data Analysis containing the following and submit the rendered HTML file (eg. a3_123456.nb.html by replacing 123456 with your ID)
    1. create an R Notebook using the R Notebook Template in Moodle, save as a3_123456.Rmd,
    2. write your name and ID and the contents,
    3. run each code block,
    4. preview to create a3_123456.nb.html,
    5. submit a3_123456.nb.html to Moodle.
  1. Choose at least one indicator of WDI

    • Information of the data: Name, Indicator, Description, Source, etc.
    • Download the data with WDI
    • Explain why you chose the indicator
    • List questions you want to study

  1. Explore the data using visualization using ggplot2

    • Use a histogram (geom_histogram), boxplot (geom_boxplot), a scatter plot (geom_point), a line plot (geom_line)
    • For at least one chart, add title, and labels of axis, and add an explanation of it
  2. Observations and difficulties encountered.

Due: 2023-01-16 23:59:00. Submit your R Notebook file in Moodle (The Third Assignment). Due on Monday!

4 Exploratory Data Analysis (EDA) IV

4.1 Tidy Data

4.1.1 Reviews and Previews

4.1.2 Example: World Inequility Report - WIR2022

library(tidyverse)
library(readxl)
url_summary <- "https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx"
download.file(url = url_summary, destfile = "data/WIR2022s.xlsx") 
excel_sheets("data/WIR2022s.xlsx")

4.1.3 F1: Global income and wealth inequality, 2021

df_f1 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F1")
df_f1

df_f1_rev %>%
  ggplot(aes(x = cat, y = value, fill = group)) +
  geom_col(position = "dodge")

4.1.4 References of tidyr

4.1.4.1 RStudio Primers: See References in Moodle at the bottom

Tidy Your Data

  • Reshape Data
  • Separate and Unite Columns
  • Join Data Sets

4.1.5 Variables, values, and observations: Definitions

  • A variable is a quantity, quality, or property that you can measure.
  • A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
  • An observation or case is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a case or data point.
  • Tabular data is a table of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own cell, each variable in its own column, and each observation in its own row.
  • So far, all of the data that you’ve seen has been tidy. In real-life, most data isn’t tidy, so we’ll come back to these ideas again in Data Wrangling.

4.1.6 Tidy Data

“Data comes in many formats, but R prefers just one: tidy data.” — Garrett Grolemund

Data can come in a variety of formats, but one format is easier to use in R than the others. This format is known as tidy data. A data set is tidy if:

  1. Each variable is in its own column
  2. Each observation is in its own row
  3. Each value is in its own cell (this follows from #1 and #2)

“Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham

“all happy families are all alike; each unhappy family is unhappy in its own way” - Tolstoy’s Anna Karenina


4.1.7 tidyr Basics

  1. Each variable is in its own column
  2. Each observation is in its own row

4.1.8 Pivot data from wide to long: pivot_longer()

pivot_longer(data, cols = <columns to pivot into longer format>,
  names_to = <name of the new character column>, # e.g. "group", "category", "class"
  values_to = <name of the column the values of cells go to>) # e.g. "value", "n"
df_f1
(df_f1_rev <- df_f1 %>% pivot_longer(-1, names_to = "group", values_to = "value"))

df_f1_rev %>% 
  ggplot(aes(x = ...1, y = value, fill = group)) +
  geom_col(position = "dodge")

df_f1_rev %>% filter(group != "Top 1%") %>%
  ggplot() +
  geom_col(aes(x = ...1, y = value, fill = group), position = "dodge") +
  geom_text(aes(x = ...1, y = value, group = group, 
            label = scales::label_percent(accuracy=1)(value)), 
            position = position_dodge(width = 0.9)) + 
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 1. Global income and wealth inequality, 2021",
       x = "", y = "Share of total income or wealth", fill = "")

Interpretation: The global bottom 50% captures 8.5% of total income measured at Purchasing Power Parity (PPP). The global bottom 50% owns 2% of wealth (at Purchasing Power Parity). The global top 10% owns 76% of total Household wealth and captures 52% of total income in 2021. Note that top wealth holders are not necessarily top income holders. Incomes are measured after the operation of pension and unemployment systems and before taxes and transfers.
Sources and series: wir2022.wid.world/methodology.


4.1.9 F2: The poorest half lags behind: Bottom 50%, middle 40% and top 10% income shares across the world in 2021

df_f2 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F2")
df_f2

df_f2 %>% pivot_longer(cols = 3:5, names_to = "group", values_to = "value")

df_f2 %>% pivot_longer(cols = 3:5, names_to = "group", values_to = "value") %>%
  ggplot(aes(x = iso, y = value, fill = group)) +
  geom_col(position = "dodge")

4.1.10 Pivot data from long to wide:

pivot_wider() In Console: vignette(“pivot”)

pivot_wider(data, 
  names_from = <name of the column (or columns) to get the name of the output column>,
  values_from = <name of the column to get the value of the output>) 

pivot_wider(data, names_from = group, values_from = value) 

4.1.11 Practice: F4 and F13

F4 and F13 are similar. Please use pivot_longer to tidy the data and create charts.

4.1.11.1 Done Last Week

  • F12: Female share in global labor incomes, 1990-2020
  • F14: Global carbon inequality, 2019. Group contribution to world emissions (%)

4.1.12 F3: Top 10/Bottom 50 income gaps across the world, 2021

df_f3 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F3")
df_f3

4.1.13 F3: Top 10/Bottom 50 income gaps across the world, 2021 - Original


  • To 10 / Bottom 50 ratio has 5 classes: 5-12, 12-13, 13-16, 16-19, 19-140
df_f3$T10B50 %>% summary()

df_f3 %>% ggplot() + geom_histogram(aes(T10B50))

df_f3 %>% arrange(desc(T10B50))

df_f3 %>% 
  mutate(`Top 10 Bottom 50 Ratio` = cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), 
                                        include.lowest = FALSE)) 

world_map <- map_data("world")
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), 
                                        include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + 
  geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), map = world_map) + 
  expand_limits(x = world_map$long, y = world_map$lat)

world_map_wir <- world_map
world_map_wir$region[
  world_map_wir$region=="Democratic Republic of the Congo"]<-"DR Congo"
world_map_wir$region[world_map_wir$region=="Republic of Congo"]<-"Congo"
world_map_wir$region[world_map_wir$region=="Ivory Coast"]<-"Cote dIvoire"
world_map_wir$region[world_map_wir$region=="Vietnam"]<-"Viet Nam"
world_map_wir$region[world_map_wir$region=="Russia"]<-"Russian Federation"
world_map_wir$region[world_map_wir$region=="South Korea"]<-"Korea"
world_map_wir$region[world_map_wir$region=="UK"]<-"United Kingdom"
world_map_wir$region[world_map_wir$region=="Brunei"]<-"Brunei Darussalam"
world_map_wir$region[world_map_wir$region=="Laos"]<-"Lao PDR"
world_map_wir$region[world_map_wir$region=="Cote dIvoire"]<-"Cote d'Ivoire"
world_map_wir$region[world_map_wir$region=="Cape Verde"]<- "Cabo Verde"
world_map_wir$region[world_map_wir$region=="Syria"]<- "Syrian Arab Republic"
world_map_wir$region[world_map_wir$region=="Trinidad"]<- "Trinidad and Tobago"
world_map_wir$region[world_map_wir$region=="Tobago"]<- "Trinidad and Tobago"

df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = 
    cut(T10B50, breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + 
  geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), 
    map = world_map_wir) + 
    expand_limits(x = world_map_wir$long, y = world_map_wir$lat)

df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = 
    cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), 
    map = world_map_wir) + expand_limits(x = world_map_wir$long, y = world_map_wir$lat) + 
  coord_map("orthographic", orientation = c(25, 60, 0))

df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = 
  cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), 
    map = world_map_wir) + expand_limits(x = world_map_wir$long, y = world_map_wir$lat) + 
  coord_map("orthographic", orientation = c(15, -80, 0))

df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = 
  cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), 
    map = world_map_wir) + 
  expand_limits(x = world_map_wir$long, y = world_map_wir$lat)

df_f3 %>% 
  mutate(`Top 10 Bottom 50 Ratio` = 
        cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + 
  geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), map = world_map_wir) + 
  expand_limits(x = world_map_wir$long, y = world_map_wir$lat)  + 
  labs(title = "Figure 3. Top 10/Bottom 50 income gaps across the world, 2021",
       x = "", y = "", fill = "Top 10/Bottom 50 ratio") +
  theme(legend.position="bottom", 
        axis.text.x=element_blank(), axis.ticks.x=element_blank(),
        axis.text.y=element_blank(), axis.ticks.y=element_blank()) + 
  scale_fill_brewer(palette='YlOrRd')


df_f3 %>% anti_join(world_map_wir, by = c("Country" = "region"))

Filtering joins

  • anti_join(x,y, ...): return all rows from x without a match in y.
  • semi_join(x,y, ...): return all rows from x with a match in y.

Check dplyr cheat sheet, and Posit Primers Tidy Data.


4.1.14 Remaining Charts

  • F5: Global income inequality: T10/B50 ratio, 1820-2020 - fit curve

  • F9: Average annual wealth growth rate, 1995-2021 - fit curve + alpha

  • F7: Global income inequality, 1820-2020 - pivot + fit curve

  • F10: The share of wealth owned by the global 0.1% and billionaires, 2021 - pivot + fit curve

  • F6: Global income inequality: Between vs. Within country inequality (Theil index), 1820-2020 - pivot + area

  • F11: Top 1% vs bottom 50% wealth shares in Western Europe and the US, 1910-2020 - pivot name_sep + fit curve

  • F8: The rise of private versus the decline of public wealth in rich countries, 1970-2020 - rename + pivot + pivot + fit curve

  • F15: Per capita emissions acriss the world, 2019 - add row names + dodge


4.1.15 F5: Global income inequality: T10/B50 ratio, 1820-2020

(df_f5 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F5"))

df_f5 %>% ggplot(aes(x = y, y = t10b50)) + geom_line() + geom_smooth(span=0.25, se=FALSE)

4.1.16 F9: Average annual wealth growth rate, 1995-2021 - fit curve + alpha

df_f9 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F9"); df_f9

df_f9 %>% 
  ggplot(aes(x = p, y = `Wealth growth 1995-2021`)) + geom_smooth(span = 0.30, se = FALSE)

4.1.17 F7: Global income inequality, 1820-2020 - pivot + fit curve

df_f7 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F7"); df_f7

df_f7 %>% 
  pivot_longer(cols = 2:4, names_to = "type", values_to = "value") %>%
  ggplot(aes(x = y, y = value, color = type)) +
  stat_smooth(formula = y~x, method = "loess", span = 0.25, se = FALSE)

4.1.18 F10: The share of wealth owned by the global 0.1% and billionaires, 2021 - pivot + fit curve

df_f10 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F10"); df_f10

df_f10 %>% 
  select(year, "Global Billionaire Wealth" = bn_hhweal, "Top 0.01%" = top0.1_hhweal) %>%
  pivot_longer(!year, names_to = "group",".value", values_to = "value")

df_f10 %>% 
  select(year, "Global Billionaire Wealth" = bn_hhweal, "Top 0.01%" = top0.1_hhweal) %>%
  pivot_longer(!year, names_to = "group",".value", values_to = "value") %>%
  ggplot() +
  stat_smooth(aes(x = year, y = value, color = group), formula = y~x, method = "loess", span = 0.25, se = FALSE)

4.1.19 F6: Global income inequality: Between vs. Within country inequality (Theil index), 1820-2020 - pivot + area

df_f6 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F6"); df_f6

df_f6 %>% select(year = "...1", 2:3) %>%
  pivot_longer(cols = 2:3, names_to = "type", values_to = "value") %>%
  mutate(types = factor(type, 
      levels = c("Within-country inequality", "Between-country inequality"))) %>%
  ggplot(aes(x = year, y = value, fill = types)) +
  geom_area() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks = round(seq(1820, 2020, by = 20),1)) + 
  scale_fill_manual(values=rev(scales::hue_pal()(2)), 
      labels = function(x) str_wrap(x, width = 15)) +
  labs(title = "Figure 6. Global income inequality: 
       \nBetween vs. within country inequality (Theil index), 1820-2020",
       x = "", y = "Share of global inequality (% of total Theil index)", fill = "") + 
  annotate("text", x = 1850, y = 0.28, 
      label = stringr::str_wrap("1820: Between country inequality represents 11% 
                                of global inequality", width = 20), size = 3) + 
  annotate("text", x = 1980, y = 0.70, 
      label = stringr::str_wrap("1980: Between country inequality represents 57% 
                                of global inequality", width = 20), size = 3) +
  annotate("text", x = 1990, y = 0.30, 
      label = stringr::str_wrap("2020: Between country inequality represents 32% 
                                of global inequality", width = 20), size = 3)


4.1.20 F11: Top 1% vs bottom 50% wealth shares in Western Europe and the US, 1910-2020 - pivot name_sep + fit curve

df_f11 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F11"); df_f11

df_f11 %>% 
  rename(!year, US_bot50 = USbot50, US_top1 = UStop1, 
         EU_bot50 = EUbot50, EU_top1 = EUtop1) %>%
  pivot_longer(!year, names_to = c("group",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") %>%
  ggplot() +
  stat_smooth(aes(x = year, y = value, color = group, linetype = type), 
              span = 0.25, se = FALSE) +
  scale_x_continuous(breaks = round(seq(1910, 2020, by = 10),1)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 11. Top 1% vs bottom 50% wealth shares 
       \n in Western Europe and the US, 1910-2020", 
       x = "", y = "Share of total personal wealth (%)", color = "", linetype = "") +
  scale_linetype_manual(values = c("dotted","solid")) +
  annotate("text", x = 2000, y = 0.50, 
      label = stringr::str_wrap("Wealth inequality has been rising at 
        different speeds after a historical decline. The bottom 50% has always been 
                                extremely low.", width = 30), size = 3)

4.1.20.1 Step 1.

df_f11 %>% rename(!year, US_bot50 = USbot50, US_top1 = UStop1, 
                  EU_bot50 = EUbot50, EU_top1 = EUtop1) 

4.1.20.2 Step 2.

df_f11 %>% 
  rename(!year, US_bot50 = USbot50, US_top1 = UStop1, 
         EU_bot50 = EUbot50, EU_top1 = EUtop1) %>%
  pivot_longer(!year, names_to = c("group",".value"), names_sep = "_")

4.1.20.3 Step 2.


4.1.20.4 Step 3.

df_f11 %>% 
  rename(!year, US_bot50 = USbot50, US_top1 = UStop1, 
         EU_bot50 = EUbot50, EU_top1 = EUtop1) %>%
  pivot_longer(!year, names_to = c("group",".value"), 
               names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") 

4.1.20.5 Step 3.



4.1.21 F8: The rise of private versus the decline of public wealth in rich countries, 1970-2020 - rename + pivot + pivot + fit curve

df_f8 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F8"); df_f8

df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") %>%
  ggplot() +
  stat_smooth(aes(x = year, y = value, color = country, linetype = type), 
              span = 0.25, se = FALSE, size=0.75) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 8. The rise of private versus the decline of public 
       wealth in rich countries, 1970-2020", 
       x = "", y = "wealth as as % of national income", color = "", type = "")

4.1.21.1 Step 1

df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') 


4.1.21.2 Step 2.

df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") 


4.1.21.3 Step 3.

df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value")


4.1.21.4 Step 3. Final Step

df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") %>%
  ggplot() +
  stat_smooth(aes(x = year, y = value, color = country, linetype = type), 
              formula = y~x, method = "loess", span = 0.25, se = FALSE, size=0.75) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 8. The rise of private versus the decline of public wealth 
       \nin rich countries, 1970-2020", 
       x = "", y = "wealth as as % of national income", color = "", type = "")


4.1.22 F15: Per capita emissions acriss the world, 2019 - add row names + dodge

df_f15 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F15"); df_f15

df_f15 %>% mutate(region = rep(regionWID[!is.na(regionWID)], each = 3)) %>%
  select(region, group, tcap) %>%
  ggplot(aes(x = region, y = tcap, fill = group)) +
  geom_col(position = "dodge") + 
  scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 10)) +
  labs(title = "Figure 15 Per capita emissions across the world, 2019", 
       x = "", y = "tonnes of CO2e per person per year", fill = "")

4.2 EDA Workflow

4.2.1 EDA Step 0

  1. Choose and clarify a topic to study.
  2. List questions to study
  3. Find data:
  • link to data with a url: universal resource locator in a webpage
  • download data in csv, Excel, etc.

Repeat the process during your EDA.

image

4.2.2 EDA by R Studio: Step 1

In RStudio,

1.1. Project

  • Create a new project: File > New Project; or
  • Open a project: File > Open Project, Open Project in New Session, Open Recent Project
  • Check there is a file project_name.Rproj in your project folder (directory)

1.2. data folder (directory) data

  • Create a data folder: Press New Folder at the right bottom pane; or
  • Confirm the data folder previously created: Press Files at the right bottom pane
  • If you follow 1, the data folder exists in your project folder

1.3. Move (or copy) data for the project to the data folder

  • If you downloaded the data, it is in your Download folder. Move it to data.
  • Check in your RStudio that your data is in data: Press Files at the right bottom pane and click data, the data folder.

4.2.3 EDA by R Studio: Step 2

2.1. Project Notebook: Memo

  • Create an R Notebook: File > New File > R Notebook

    • You can use R Notebook template in Moodle by moving the template (template.Rmd or template.nb.Rmd) file in your project folder or copy and paste the text file into your new R Notebook.
    • If you use template.nb.Rmd (R Notebook File), choose Open in Editor.
  • Add descriptive title.

2.2. Setup Code Chunk

  • Create a code chunk and add packages to use in the project and RUN the code.

    • library(tidyverse)
    • library(WDI)
    • or any other packages

2.3. Choose Source or Visual editor mode, and start editing Project Notebook

2.4. Edit a new file by saving as for a report

  • File > Save As…

4.2.4 EDA by R Studio: Step 3 - Importing Data

Assign a name you can recall easily when you import data. You may need to reload the data with options.

3.1. Use a package:

  • WDI, wir, eurostat, etc/
  • `wdi_shortname <- WDI(indicator = “indicator’s name”, … )
  • Store the data and use it: write_csv(wdi_shortname, "data/wdi_shortname.csv")
  • wdi_shortname <- read_csv("data/wdi_shortname.csv")

3.2. Use readr to read from data, your data folder

  • df1_shortname <- read_csv("data/file_name.csv")

3.3. Use readr to read using the url of the data

  • df2_shortname <- read_csv("url_of_the_data")
  • Store the data and use it: write_csv(df2_shortname, "data/df2_shortname.csv")
  • df2_shortname <- read_csv("data/df2_shortname.csv")

3.5. Use readxl to read Excel data. Add library(readxl) in the setup and run.

  • df4 <- read_excel("data/file_name.xlsx", sheet = 1)

References: Cheat Sheet - readr, readr, readxl


4.2.5 EDA by R Studio: Step 4 - Data Trasnformation

4.1. Look at the data: suppose df is the data frame

  • It is a good option to change into a tibble: dt <- as_tibble(df)
  • head(df), str(df), summary(df), dt, glimpse(dt)

4.2. Look at each variable

  • categorical? numerical?
  • factor? - forcats

4.3. Variation of each data: suppose x1 is a column name.

  • df %>% ggplot() + geom_histogram(aes(x1), bins = 30)

  • df %>% drop_na(x1): see the rows with a value in x1. If the value is NA, the row is not shown.

    • df_wo_na <- df %>% drop_na(x1) if you want to use only the rows without NA in x1

4.4. Use dpylr and tidyr to change column names, tidy data, and/or summarize data

  • rename, select, filter, arrange, mutate, pivot_longer(), pivot_wider(), group_by and summarize

References: Cheat Sheet - dplyr and tidyr, dplyr, tidyr


4.2.6 EDA by R Studio: Step 5 - Visualize Data

5.1. In combination with Stap 4 - data transformation, try various data visualization.

  • What type of variation occurs within my variables?
  • What type of covariation occurs between my variables?

5.2. Keep a record of what you can observe by the visualization

5.3. Edit the list of questions by adding or polishing

5.4. Select several informative chart and add options

5.5. Look at examples from the textbooks or teaching site to have better visualization

References: Cheat Sheet - ggplot2 ggplot2, ggplot2 book


4.2.7 EDA by R Studio: Step 6 - Conclusions and Questions for Further Study

  1. EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

  2. Generate questions about your data

  3. Search for answers by visualising, transforming, and/or modeling your data

Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.


4.2.8 Example: WDI


4.2.9 Example: WIR2022

4.3 The Week Five Assignment (in Moodle)

tidyr and WIR2022

  • Create an R Notebook of a Data Analysis containing the following and submit the rendered HTML file (eg. a3_123456.nb.html by replacing 123456 with your ID)
    1. create an R Notebook using the R Notebook Template in Moodle, save as a3_123456.Rmd,
    2. write your name and ID and the contents,
    3. run each code block,
    4. preview to create a3_123456.nb.html,
    5. submit a3_123456.nb.html to Moodle.
  1. Choose a data with at least two categorical variables and at least two numerical variables.

    • Information of the data: Name, Indicator, Description, Source, etc.
    • Explain why you chose the indicator
    • List questions you want to study

  1. Explore the data using visualization using ggplot2

    • Create various charts
    • Create at least one chart with at least two categorical variables and at least one numerical variable.
    • Create at least one chart with at least two numerical variables and at least one categorical variable.
  2. Observations based on your data visualization, and difficulties and questions encountered if any.

Due: 2023-01-23 23:59:00. Submit your R Notebook file in Moodle (The Fourth Assignment). Due on Monday!

5 Exploratory Data Analysis (EDA) V

5.1 Modeling

“Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham

“all happy families are all alike; each unhappy family is unhappy in its own way” - Tolstoy’s Anna Karenina

Correlation

iris %>% select(-5) %>% cor()
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Galton’s data. Regression.

Normalization, standalization

SDGs Academy

Trend

df4 <- WDI(
  country="all",
  indicator = c("SP.POP.TOTL","MS.MIL.XPND.GD.ZS","MS.MIL.TOTL.P1","NY.GDP.MKTP.CD"),
  start = 1960,
  end = NULL,
  extra = FALSE,
  cache = NULL,
  latest = NULL,
  language = "en"
)
colnames(df4) <- c("Country","iso2c","iso3c","Year","Population","Mil_Exp","Total_Mil_HC","GDP")
df4

5.1.1 Tali

df_wdi_poverty <- WDI(
  country = "all", 
  indicator = c(national_poverty_rate = "SI.POV.NAHC", multidimentional_poverty_rate = "SI.POV.MDIM", gdpPercap = "NY.GDP.PCAP.KD", gini_indx = "SI.POV.GINI"), start = 1990,
    end = 2021,
  extra = TRUE
)
df_wdi_poverty %>%  
  group_by(country, year) %>%
  mutate(mean_gdp_country = mean(gdpPercap)) %>%
  mutate(mean_poverty_country= mean(national_poverty_rate)) %>%
  ungroup() %>%
    filter(!is.na(income)) %>%   filter(!(income=="Aggregates")) %>% ggplot(aes(x = log10(mean_gdp_country))) + geom_point(aes(y = mean_poverty_country, color = income)) +  labs(x = "GDP per capita", y = "poverty rate (% of population)", title = "Poverty rates and GDP per capita", subtitle="world countries, 1990-2021 average, by income level")
df_wdi_poverty %>%  
  group_by(country, year) %>%
  mutate(mean_gdp_country = mean(gdpPercap)) %>%
  mutate(mean_multipoverty_country= mean(multidimentional_poverty_rate)) %>%
  ungroup() %>%
    filter(!(region=="Aggregates")) %>% filter(!is.na(region)) %>% ggplot(aes(x = log10(mean_gdp_country))) + geom_point(aes(y = mean_multipoverty_country, color = region)) +  labs(x = "GDP per capita", y = "Multidimentinal poverty rate (% of population)", title = "Multidimentional Poverty rates and GDP per capita", subtitle="world countries, 1990-2021 average, by region")

index sdg, ghi, academy

moocs

R4DS: Model basics https://r4ds.had.co.nz/model-basics.html modelr: https://modelr.tidyverse.org

Tidymodels: https://www.tidymodels.org

Tidyverse Skills for Data Science https://jhudatascience.org/tidyversecourse/ https://jhudatascience.org/tidyversecourse/model.html

machine learning:A Gentle Introduction to tidymodels https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/

Tidy Modeling with R https://www.tmwr.org

5.1.2 Other Examples

5.2 Roudups

5.2.1 R Markdown Revisited

5.2.1.1 Literate Programming and Reproducible Research

Importing Data:

  1. Read a csv file: read_csv("data/file_name.csv")
  2. Download and import using a url of a csv file: read_csv(url)
  3. Read an Excel file: readxl::read_excel("data/excel_file_name.xlsx")
  4. Read from the clipboard: read_delim(clipboard())
  • Not reproducible unless clearly explained.

5.2.1.2 Code Chunk Options

https://yihui.org/knitr/options/

  • Chunk Name
  • Output: use document default
    • Show code and output: echo=TRUE, eval=TRUE - Default
    • Show output only: echo=FALSE
    • Show nothing (run code): include=FALSE
    • Show nothing (don’t run code): include=FALSE, eval=FALSE
  • Show message: message=TRUE, FALSE
  • Show warning: warning=TRUE, FALSE
  • Use Paged Tables: paged.print=TRUE, FALSE
  • Use custom figure size: width and height in inch.

5.2.1.3 Presentation and Paper

  1. Data Source

  2. Variables

  3. Problems

  4. Visualization

  5. Model

  6. Conclusions and Further Research

    WDI, WIR, etc

5.2.1.4 Word

Custom Word templates: https://bookdown.org/yihui/rmarkdown-cookbook/word-template.html

You can apply the styles defined in a Word template document to new Word documents generated from R Markdown. Such a template document is also called a “style reference document.” The key is that you have to create this template document from Pandoc first, and change the style definitions in it later. Then pass the path of this template to the reference_docx option of word_document

---
 word_document:
    reference_docx: "template.docx"
---

5.2.1.5 PowerPoint

PowerPoint presentation: https://bookdown.org/yihui/rmarkdown/powerpoint-presentation.html

Custom templates: https://bookdown.org/yihui/rmarkdown/powerpoint-presentation.html#ppt-templates

---
  powerpoint_presentation:
    reference_doc: my-styles.pptx
---

https://support.microsoft.com/en-us/office/create-and-save-a-powerpoint-template-ee4429ad-2a74-4100-82f7-50f8169c8aca

YouTube: How To Create A PowerPoint Template

---
title: 'QALL401: Data Analysis for Researchers'
output:
  html_notebook:
    number_sections: yes
  html_document:
    df_print: paged
    number_sections: yes
  beamer_presentation: default
  pdf_document:
    number_sections: yes
  ioslides_presentation: 
    df_print: paged
    widescreen: yes
    smaller: yes
---

## Course Contents {-}

1. 2022.12.07: Introduction: About the course [lead by TK]
    - An introduction to open and public data, and data science
2. 2022-12-14: Exploratory Data Analysis (EDA) 1 [lead by hs]  
    - R Basics with RStudio and/or RStudio.cloud; Toy Data
3. 2022-12-21: Exploratory Data Analysis (EDA) 2 [lead by hs]   
    - R Markdown, `tidyverse` I: `dplyr`; `gapminder`
4. 2023-01-11: Exploratory Data Analysis (EDA) 3 [lead by hs]  
    - `tidyverse`II: `readr`, `ggplot2`; Public Data, WDI, WIR, etc
5. 2023-01-18: Exploratory Data Analysis (EDA) 4 [lead by hs]  
    - `tidyverse` III: `tidyr`, etc.; WDI, WIR, etc
6. 2023-01-25: Exploratory Data Analysis (EDA) 5 [lead by hs]  
    - `tidyverse` IV; WDI, WIR, etc
7. 2023-02-01: Introduction to PPDAC         
    - Problem-Plan-Data-Analysis-Conclusion Cycle: [lead by TK]
8. 2023-02-08: Model building I [lead by TK]    
    - Collecting and visualizing data and Introduction to WDI  
         (World Development Indicators by World Bank)
9. 2023-02-15: Model building II [lead by TK]    
    - Analyzing data and communications
10. 2023-02-22: Project Presentation


# Exploratory Data Analysis (EDA) I

## R with R Studio and/or R Studio.cloud

---

### Learning Resources

#### Textbooks and References

* "R for Data Science" by Hadley Wickham and Garrett Grolemund: 
  - Free Online Book: https://r4ds.had.co.nz

* Visit `bookdown` site: https://bookdown.org 
  - Many more on the [archive page](https://bookdown.org/home/archive/).


### Interactive Exercises

* Posit Primers:https://posit.cloud/learn/primers:  
  - The Basics, Work with Data, Visualize Data, Tidy Your Data, Report Reproducibly

* {swirl} Learn R, in R: https://swirlstats.com
  - Designed and developed by a team at Johns Hopkins University for `coursera` courses

---

### Posit Primers created by `learnr`

* [`learnr` Interactive Tutorials for R](https://rstudio.github.io/learnr/index.html)

::: {.block}
#### Posit Primers https://posit.cloud/learn/primers

1. The Basics -- [r4ds: Explore, I](https://r4ds.had.co.nz/explore-intro.html#explore-intro)
  - [Visualization Basics](https://rstudio.cloud/learn/primers/1.1)
  - [Programming Basics](https://rstudio.cloud/learn/primers/1.2)
2. Work with Data -- [r4ds: Wrangle, I](https://r4ds.had.co.nz/wrangle-intro.html#wrangle-intro)
  - Working with Tibbles
  - Isolating Data with dplyr
  - Deriving Information with dplyr
3. Visualize Data -- [r4ds: Explore, II](https://r4ds.had.co.nz/explore-intro.html#explore-intro)
4. Tidy Your Data -- [r4ds: Wrangle, II](https://r4ds.had.co.nz/wrangle-intro.html#wrangle-intro)
5. Iterate -- [r4ds: Program](https://r4ds.had.co.nz/program-intro.html#program-intro)
6. Write Functions -- [r4ds: Program](https://r4ds.had.co.nz/program-intro.html#program-intro)
:::
---

### Data Science and EDA

#### Wikipedia https://en.wikipedia.org/wiki/Data_science

> An inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.

* Create Insights
* Impact Decision Making
* Maintain & Improve Overtime

---

### What is R?

#### R (programming language), [Wikipedia](https://en.wikipedia.org/wiki/R_(programming_language))

* **R is a programming language** and **free software** environment for **statistical computing and graphics** supported by the R Foundation for Statistical Computing. 

* The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

* A **GNU package**, the official R software environment is written primarily in C, Fortran, and R itself (thus, it is partially self-hosting) and is freely available under the GNU General Public License. 

---

#### History of R and more

"R Programming for Data Science" by Roger Peng

* [Chapter 2. History and Overview of R](https://bookdown.org/rdpeng/rprogdatascience/history-and-overview-of-r.html)
* [Overview and History of R: Youtube video](https://www.youtube.com/watch?v=STihTnVSZnI&feature=youtu.be)

---

### Why R? -- Responses by Hadley Wickham

#### [r4ds](https://r4ds.had.co.nz/introduction.html#python-julia-and-friends): R is a great place to start your data science journey because

* R is an environment designed from the ground up to support data science. 
* R is not just a programming language, but it is also an interactive environment for doing data science. 
* To support interaction, R is a much more flexible language than many of its peers. 

---

#### Why R today?

When you talk about choosing programming languages, I always say you shouldn’t pick them based on technical merits, but rather pick them based on the community. And I think the R community is like really, really strong, vibrant, free, welcoming, and embraces a wide range of domains. So, if there are like people like you using R, then your life is going to be much easier. That’s the first reason. 

**Interview**: ["Advice to Young (and Old) Programmers, H. Wickham"](https://www.r-bloggers.com/2018/08/advice-to-young-and-old-programmers-a-conversation-with-hadley-wickham/)

---

### What is RStudio? https://posit.com

> RStudio is an integrated development environment, or IDE, for R programming. 

#### R Studio (Wikipedia)

RStudio is an integrated development environment (IDE) for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser.

---

### Installation of R and R Studio

#### R Installation

To download R, go to CRAN, the comprehensive R archive network. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don’t try and pick a mirror that’s close to you: instead use the cloud mirror, https://cloud.r-project.org, which automatically figures it out for you.

A new major version of R comes out once a year, and there are 2-3 minor releases each year. It’s a good idea to update regularly.

#### R Studio Installation

Download and install it from http://www.rstudio.com/download. 

RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know.

---

### R Studio

#### The First Step
1. Start R Studio Application
2. Top Menu: File > New Project > New Directory > New Project > _Directory name or Browse the directory and choose the parent directory you want to create the directory_

#### When You Start the Project
1. Go to the directory you created
2. Double click _'Directory Name'.Rproj
  
Or,

1. Start R Studio
2. File > Open Project (or choose from Recent Project)

_In this way the working directory of the session is set to the project directory and R can search releted files without difficulty_ (`getwd()`, `setwd()`)

---

### Posit Cloud

RStudio Cloud is a lightweight, cloud-based solution that allows anyone to do, share, teach and learn data science online.

#### Cloud Free

* Up to 15 projects total
* 1 shared space (5 members and 10 projects max)
* 15 project hours per month
* Up to 1 GB RAM per project
* Up to 1 CPU per project
* Up to 1 hour background execution time

---

#### How to Start Posit Cloud

1. Go to https://posit.cloud/
2. Sign Up: _top right_
  - Email address or Google account
3. New Project: _Project Name_
4. R Console



## Let's Get Started

Start RStudio and create a project, or login to Posit Cloud and create a project.

---

### The First Examples

Input the following codes into Console in the left bottom pane.

* The first two:

```{r}
head(cars)
```

---

```{r}
str(cars)
```

---

* Two more:

```{r}
summary(cars)
```

---

```{r eval=FALSE}
plot(cars)
```

```{r cars-plot, echo=FALSE}
plot(cars)
```

---

* And three more:

```{r eval=FALSE}
plot(cars) # cars: Speed and Stopping Distances of Cars
abline(lm(cars$dist~cars$speed))
```
```{r echo=FALSE}
plot(cars) # cars: Speed and Stopping Distances of Cars
abline(lm(cars$dist~cars$speed))
```

---

```{r}
lm(cars$dist~cars$speed)
```

---

```{r}
summary(lm(cars$dist~cars$speed))
```

---

#### Brief Explanation

* `head(cars)`: The first 6 rows of the pre-installed data `cars`.
* `str(cars)`: The data structure of the pre-installed data `cars`.
* `summary(cars)`: The summary of the pre-installed data `cars`.
* `plot(cars)`: A scatter plot of the pre-installed data `cars`.
  - `plot(cars$dist~cars$speed)`
  - `cars$dist`, `cars$[[2]]`, `cars[,2]` are same
* `abline(lm(cars$dist~cars$speed))`: Add a regression line of a linear model
* `lm(cars$dist~cars$speed)`: The equation of the regression line
* `summary(lm(cars$dist~cars$speed)`: The summary of the linear regression model

---

```{r, eval=FALSE}
hist(cars$dist)
```
```{r, echo=FALSE}
hist(cars$dist)
```

---

```{r, eval=FALSE}
hist(cars$speed)
```
```{r, echo=FALSE}
hist(cars$speed)
```

---

#### View and help

* `View(cars)`
* `?cars`: same as `help(cars)`
* `??cars`: same as `help.search("cars")

#### `datasets`

* `?datasets`
* `library(help = "datasets")`

* `data()` shows all data already attached and available.

---

### Practicum

Pick a data in the datasets package and try

* `head()`
* `str()`
* `summary()`

and some more.

---

#### `iris`

```{r}
head(iris)
```

---

```{r}
str(iris)
```

---

```{r}
summary(iris)
```

---

Can you plot?

```{r eval=FALSE}
plot(iris$Sepal.Length, iris$Sepal.Width)
```
```{r echo=FALSE}
plot(iris$Sepal.Length, iris$Sepal.Width)
```

## `tidyverse` Packages

### Brief Introduction to R on RStudio

#### Four Panes and Tabs

1. Top Left: Source Editor
2. Top Right: Environment, History, etc.
3. Bottom Left: Console, Terminal, Render, Background Jobs
4. Bottom Right: Files, Plots, Packages, Help, Viewer, Presentation

---

#### Set up

* Highly recommend to set the language to be "English".
* Create "data" directory.

```{r warning=FALSE, eval=FALSE}
Sys.setenv(LANG = "en")
dir.create("data")
```

---

#### Three Ways to Run Codes

1. Console - Bottom Left Pane
2. R Script - pull down menu under File
3. R Notebook, R Markdown - pull down menu under File

---

### Second Way: R Script

#### Examples: R Scripts in Moodle

* `basics.R`
* `coronavirus.R`

1. Copy a script in Moodle: _{file name}.R_
2. In RStudio (create Project in RStudio) choose File > New File > R Script and paste it.
3. Choose File > Save with a name; e.g. _{file names}_ (.R will be added automatically)

To run a code: at the cursor press *Ctrl+Shift+Enter* (Win) or *Cmd+Shift+Enter* (Mac). 

---

### Packages

R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN (the Comprehensive R Archive Network).

#### Installation and attachement

You can install packages by "Install Packages..." under "Tool" in the top menu.

* `install.packages("tidyverse")`
* `install.packages("rmarkdown")`

---

### Third Way: R Notebook

Choose R Notebook from the pull down File menu in the top bar.

### Edit YAML

**Default* is as follows**

```
---
title: "R Notebook"
output: html_notebook
---
```

---

**Template**

```
---
title: "Title of R Notebook"
author: "ID and Your Name"
date: "`r Sys.Date()`" 
output: 
  html_notebook:
#    number_sections: yes
#    toc: true
#    toc_float: true
---
```

* Don't change the format. Indention matters.
* The statement after \# is ignored.
* Date is automatically inserted when you compile the file.
* You can replace "`r Sys.Date()`" by "2022-12-14" or in any date format surrounded by double quotation marks.
* Section numbers: - default is `number_sections: no`.
* Table of contents, `toc: true` - default is `toc: false`.
* Floating table of contents in HTML output, `toc_float: true` - default is `toc_float: false`

---

### Create a Code Chunk to Attach Packages

Insert Chunk in Code pull down menu in the top bar, or use the <kbd>C</kbd> button on top. You can use shortcut keys listed under Tools in the top bar.

```{r}
library(tidyverse)
```


## First Example

### Importing data

Let us assign the `iris` data in the pre-installed package `datasets` to `df_iris`. You can give any name starting from an alphabet, though there are some rules. 

```{r}
df_iris <- datasets::iris
class(df_iris)
```

The class of data `iris` is `data.frame`, the basic data class of R. You can assign the same data as a `tibble`, the data class of `tidyverse` as follows.

```{r}
tbl_iris <- as_tibble(datasets::iris)
class(tbl_iris)
```

* `df_iris <- iris` can replace  `df_iris <- datasets::iris` because the package `datasets` is installed and attached as default. Since you may have other data called `iris` included in a different package or you may have changed `iris` before, it is safer to specify the name of the package with the name of the data.
* Within R Notebook or in Console, you may get different output, and `tf_iris` and `tbl_iris` behave differently. It is because of the default settings of R Markdown. 

---

### Look at the data

#### Several ways to view the data.

The `View` command open up a window to show the contents of the data and you can use the filter as well.

```{r viewiris, eval = FALSE}
View(df_iris)
```

---

The following simple command also shows the data. 
```{r dfiris}
df_iris
```

The output within R Notebook is a tibble style. Try the same command in Console.

---

```{r slice10iris}
slice(df_iris, 1:10)
```


```{r one2ten}
1:10
```
`
---

#### Data Structure

Let us look at the structure of the data. You can try `str(df_iris)` on Console or by adding a code chunk in R Notebook introducing later.

```{r glimpse:eiris}
glimpse(df_iris)
```

There are six types of data in R; Double, Integer, Character, Logical, Raw, Complex.

The names after $ are column names. If you call `df_iris$Species`, you have the Species column. Species is in the 5th collumn, `typeof(df_iris[[5]])` does the same as the next. 

`df_iris[2,4] = ``r df_iris[2,4]` is the fourth entry of Sepal.Width.

---

```{r}
typeof(df_iris$Species)
```

```{r}
class(df_iris$Species)
```

For `factors = fct` see [the R Document](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/factor) or an explanation in [Factor in R: Categorical Variable & Continuous Variables](https://www.guru99.com/r-factor-categorical-continuous.html).

---

```{r}
typeof(df_iris$Sepal.Length)
class(df_iris$Sepal.Length)
```


**Q1.** What are the differences of`df_iris`, `slice(df_iris, 1:10)` and `glimpse(df_iris)` above?

**Q2.** What are the differences of`df_iris`, `slice(df_iris, 1:10)` and `glimpse(df_iris)` in the console?

---

#### Summary of the Data

The following is very convenient to get the summary information of a data.

```{r}
summary(df_iris)
```

Minimum, 1st Quadrant (25%),  Median, Mean, 3rd Quadrant (75%), Maximum, and the count of each factor.

---

### Visualizing Data

#### Scatter Plot

We use `ggplot` to draw graphs. The scatter plot is a projection of data with two variables $x$ and $y$. 
```
ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
  geom_point()
```
```
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()
```

---


```{r message=FALSE, warning=FALSE, comment=FALSE}
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()
```

---

#### Scatter Plot with [Labels](https://ggplot2.tidyverse.org/reference/labs.html)

Add title and labels adding `labs()`. 

```
ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
  geom_point() +
  labs(title = "Title", x = "Label for x", y = "Label for y")
```
---

```{r}
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point() + 
  labs(title = "Scatter Plot of Sepal Data of Iris", x = "Sepal Length", y = "Sepal Width")
```

---

#### Scatter Plot with [Colors](https://ggplot2.tidyverse.org/reference/aes_colour_fill_alpha.html)

Add different colors automatically to each species. Can you see each group?

```{r}
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point()
```

---

#### Scatter Plot with Shapes

```{r}
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, shape = Species)) +
  geom_point()
```

---

#### [Boxplot](https://ggplot2.tidyverse.org/reference/geom_boxplot.html)

The boxplot compactly displays the distribution of a continuous variable. 

```{r}
ggplot(data = df_iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot()
```

---

#### [Histogram](https://ggplot2.tidyverse.org/reference/geom_histogram.html)

Visualize the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin. Histograms (geom_histogram()) display the counts with bars. 

```{r, message=FALSE, warning=FALSE}
ggplot(data = df_iris, aes(x = Sepal.Length)) +
  geom_histogram()
```

---

Change the number of bins by `bins =` `<number>`.

```{r}
ggplot(data = df_iris, aes(x = Sepal.Length)) +
  geom_histogram(bins = 10)
```

---

### Data Modeling 

Professor Kaizoji will cover the mathematical models and hypothesis testings. 

```{r, message=FALSE, warning=FALSE}
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
```

## Comments on Week 2

#### Helpful Resources

* Cheat Sheet in RStudio: https://www.rstudio.com/resources/cheatsheets/  
  - [RStudio IED](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rstudio-ide.pdf)
  - [Base R Cheat Sheet](https://github.com/rstudio/cheatsheets/raw/main/base-r.pdf)
* 'Quick R' by DataCamp: https://www.statmethods.net/management

* [An Introduction to R](https://cran.rstudio.com)


:::{.alertblock}
#### Practicum

* Posit Primers: The Basics: https://posit.cloud/learn/primers/1
  - Complete Visualization Basics and Programming Basics

---

#### Assignments - See Moodle

1. Assignment Week 2-1: Introduction Plus Forum  
  - Due: Tuesday, 20 December 2022, 11:59 PM

2. Assignment Week 2-2: Quiz 1 on R Basics 
  - Due: Tuesday, 20 December 2022, 11:59 PM

:::


## Swirl: An interactive learning environment for R and statistics

* {`swirl`} website: https://swirlstats.com
* JHU Data Science in coursera uses `swirl` for exercises.

### Swirl Courses

1. R Programming: The basics of programming in R
2. Regression Models: The basics of regression modeling in R
3. Statistical Inference: The basics of statistical inference in R
4. Exploratory Data Analysis: The basics of exploring data in R

You can install other `swirl` courses as well

* [Swirl Courses Organized by Title](http://swirlstats.com/scn/title.html)
* [Swirl Courses Organized by Author’s Name](http://swirlstats.com/scn/surname.html)
* [Github: swirl courses](https://github.com/swirldev/swirl_courses#swirl-courses) 
  - `install_course("Course Name Here")`

---

### Install and Start Swirl Courses

#### Three Steps to Start Swirl

```
install.packages("swirl") # Only the first time.
library(swirl) # Everytime you start swirl
swirl() # Everytime you start or resume swirl
```

#### R Programming: The basics of programming in R

\scriptsize
```
 1: Basic Building Blocks      2: Workspace and Files     3: Sequences of Numbers    
 4: Vectors                    5: Missing Values          6: Subsetting Vectors      
 7: Matrices and Data Frames   8: Logic                   9: Functions               
10: lapply and sapply         11: vapply and tapply      12: Looking at Data         
13: Simulation                14: Dates and Times        15: Base Graphics          
```


#### Recommended Sections in Order

1, 3, 4, 5, 6, 7, 12, 15, 14, 8, 9, 10, 11, 13, 2

* Section 2 discusses the directories and file systems of a computer
* Sections 9, 10, 11 are for programming

---

#### Controling a `swirl` Session

* ...  <-- That's your cue to press Enter to continue

  
* You can exit swirl and return to the R prompt (>) at any time by pressing the Esc key.

* If you are already at the prompt, type bye() to exit and save your progress. When you exit properly, you'll see a short message letting you know you've done so.

When you are at the R prompt (>):

1. Typing skip() allows you to skip the current question.
2. Typing play() lets you experiment with R on your own; swirl will ignore what you do...
3. UNTIL you type nxt() which will regain swirl's attention.
4. Typing bye() causes swirl to exit. Your progress will be saved.
5. Typing main() returns you to swirl's main menu.
6. Typing info() displays these options again.

---

#### Final Remark

You will encounter the message like ‘Would you like to receive credit for completing this course on Coursera.org?’ at the end of each course. This is for `coursera` courses. Select 'NO'. 



## More on R Script: Examples

### R Scripts in Moodle

* basics.R
* coronavirus.R

1. Copy a script in Moodle: _{file name}.R_
2. In RStudio (Workspace in RStudio.cloud, Project in RStudio) choose File > New File > R Script and paste it.
3. Choose File > Save with a name; e.g. _{file names}_ (.R will be added automatically)

---

### `basics.R`

The script with the outputs.

```{r basics, eval=FALSE}
#################
#
# basics.R
#
################
# 'Quick R' by DataCamp may be a handy reference: 
#     https://www.statmethods.net/management/index.html
# Cheat Sheet at RStudio: https://www.rstudio.com/resources/cheatsheets/
# Base R Cheat Sheet: https://github.com/rstudio/cheatsheets/raw/main/base-r.pdf
# To execute the line: Control + Enter (Window and Linux), Command + Enter (Mac)
## try your experiments on the console

## calculator

3 + 7

### +, -, *, /, ^ (or **), %%, %/%

3 + 10 / 2

3^2

2^3

2*2*2

### assignment: <-, (=, ->, assign()) 

x <- 5

x 

#### object_name <- value, '<-' shortcut: Alt (option) + '-' (hyphen or minus) 
#### Object names must start with a letter and can only contain letter, numbers, _ and .

this_is_a_long_name <- 5^3

this_is_a_long_name

char_name <- "What is your name?"

char_name

#### Use 'tab completion' and 'up arrow'

### ls(): list of all assignments

ls()
ls.str()

#### check Environment in the upper right pane

### (atomic) vectors

5:10

a <- seq(5,10)

a

b <- 5:10

identical(a,b)

seq(5,10,2) # same as seq(from = 5, to = 10, by = 2)

c1 <- seq(0,100, by = 10)

c2 <- seq(0,100, length.out = 10)

c1

c2

length(c1)

#### ? seq   ? length   ? identical

(die <- 1:6)

zero_one <- c(0,1) # same as 0:1

die + zero_one # c(1,2,3,4,5,6) + c(0,1). re-use

d1 <- rep(1:3,2) # repeat


d1

die == d1

d2 <- as.character(die == d1)

d2

d3 <- as.numeric(die == d1)

d3

### class() for class and typeof() for mode
### class of vectors: numeric, charcters, logical
### types of vectors: doubles, integers, characters, logicals (complex and raw)

typeof(d1); class(d1)

typeof(d2); class(d2)

typeof(d3); class(d3)

sqrt(2)

sqrt(2)^2

sqrt(2)^2 - 2

typeof(sqrt(2))

typeof(2)

typeof(2L)

5 == c(5)

length(5)

### Subsetting

(A_Z <- LETTERS)

A_F <- A_Z[1:6]

A_F

A_F[3]

A_F[c(3,5)]

large <- die > 3

large

even <- die %in% c(2,4,6)

even

A_F[large]

A_F[even]

A_F[die < 4]

### Compare df with df1 <- data.frame(number = die, alphabet = A_F)
df <- data.frame(number = die, alphabet = A_F, stringsAsFactors = FALSE)

df

df$number

df$alphabet

df[3,2]

df[4,1]

df[1]

class(df[1])

class(df[[1]])

identical(df[[1]], die)

identical(df[1],die)

####################
# The First Example
####################

plot(cars)

# Help

? cars

# cars is in the 'datasets' package

data()

# help(cars) does the same as ? cars
# You can use Help tab in the right bottom pane

help(plot)
? par

head(cars)

str(cars)

summary(cars)

x <- cars$speed
y <- cars$dist

min(x)
mean(x)
quantile(x)

plot(cars)

abline(lm(cars$dist ~ cars$speed))

summary(lm(cars$dist ~ cars$speed))

boxplot(cars)

hist(cars$speed)
hist(cars$dist)
hist(cars$dist, breaks = seq(0,120, 10))
```

---

### coronavirus.R

The script and its outputs.
__coronavirus.csv__ is very large


```{r cash = TRUE, eval=FALSE}
# https://coronavirus.jhu.edu/map.html
# JHU Covid-19 global time series data
# See R pakage coronavirus at: https://github.com/RamiKrispin/coronavirus
# Data taken from: https://github.com/RamiKrispin/coronavirus/tree/master/csv
# Last Updated
Sys.Date()

## Download and read csv (comma separated value) file
coronavirus <- read.csv("https://github.com/RamiKrispin/coronavirus/raw/master/csv/coronavirus.csv")
# write.csv(coronavirus, "data/coronavirus.csv")

## Summaries and structures of the data
head(coronavirus)
str(coronavirus)
coronavirus$date <- as.Date(coronavirus$date)
str(coronavirus)

range(coronavirus$date)
unique(coronavirus$country)
unique(coronavirus$type)

## Set Country
COUNTRY <- "Japan"
df0 <- coronavirus[coronavirus$country == COUNTRY,]
head(df0)
tail(df0)
(pop <- df0$population[1])
df <- df0[c(1,6,7,13)]
str(df)
head(df)
### alternatively,
head(df0[c("date", "type", "cases", "population")])
###

## Set types
df_confirmed <- df[df$type == "confirmed",]
df_death <- df[df$type == "death",]
df_recovery <- df[df$data_type == "recovery",]
head(df_confirmed)
head(df_death)
head(df_recovery)

## Histogram
plot(df_confirmed$date, df_confirmed$cases, type = "h")
plot(df_death$date, df_death$cases, type = "h")
# plot(df_recovered$date, df_recovered$cases, type = "h") # no data for recovery

## Scatter plot and correlation
plot(df_confirmed$cases, df_death$cases, type = "p")
cor(df_confirmed$cases, df_death$cases)


## In addition set a period
start_date <- as.Date("2021-07-01")
end_date <- Sys.Date() 
df_date <- df[df$date >=start_date & df$date <= end_date,]
##

## Set types
df_date_confirmed <- df_date[df_date$type == "confirmed",]
df_date_death <- df_date[df_date$type == "death",]
df_date_recovery <- df_date[df_date$data_type == "recovery",]
head(df_date_confirmed)
head(df_date_death)
head(df_date_recovery)

## Histogram
plot(df_date_confirmed$date, df_date_confirmed$cases, type = "h")
plot(df_date_death$date, df_date_death$cases, type = "h")
# plot(df_date_recovered$date, df_date_recovered$cases, type = "h") # no data for recovery

plot(df_date_confirmed$cases, df_date_death$cases, type = "p")
cor(df_date_confirmed$cases, df_date_death$cases)

### Q0. Change the values of the location and the period and see the outcomes.
### Q1. What is the correlation between df_confirmed$cases and df_death$cases?
### Q2. Do we have a larger correlation value if we shift the dates to implement the time-lag?
### Q3. Do you have any other questions to explore?

#### Extra
plot(df_confirmed$date, df_confirmed$cases, type = "h", 
     main = paste("Comfirmed Cases in",COUNTRY), 
     xlab = "Date", ylab = "Number of Cases")
```

:::

## `gapminder` Package

### Hans Rosling (1948 – 2017)

> Hans Rosling was a Swedish physician, academic, and public speaker. He was a professor of international health at Karolinska Institute[4] and was the co-founder and chairman of the Gapminder Foundation, which developed the Trendalyzer software system. ([wikipedia](https://en.wikipedia.org/wiki/Hans_Rosling))

* Books: 
  - Factfulness: Ten Reasons We're Wrong About The World - And Why Things Are Better Than You Think, 2018
  - How I Learned to Understand the World: A Memoir, 2020
* Gapminder: https://www.gapminder.org
  - [You are probably wrong about: Upgrade Your World View](https://upgrader.gapminder.org)
  - [Bubble Chart](https://www.gapminder.org/tools/#$state$time$value=2020;;&chart-type=bubbles): Income vs Life Expectancy over time, 1800 - 2020
    + How many variables?
* Videos: [The best stats you’ve ever seen, Hans Rosling](http://www.edtech.events/the-best-stats-youve-ever-seen-hans-rosling/)

---

### Factfulness is ... \hfill _From the book_

recognizing when a decision feels urgent and remembering that it rarely is.

To control the urgency instinct, take small steps.

* Take a breath. When your urgency instinct is triggered, your other instincts kick in and your analysis shuts down. Ask for more time and more information. It’s rarely now or never and it’s rarely either/or.
* Insist on the data. If something is urgent and important, it should be measured. Beware of data that is relevant but inaccurate, or accurate but irrelevant. Only relevant and accurate data is useful.
* Beware of fortune-tellers. Any prediction about the future is uncertain. Be wary of predictions that fail to acknowledge that. Insist on a full range of scenarios, never just the best or worst case. Ask how often such predictions have been right before.
* Be wary of drastic action. Ask what the side effects will be. Ask how the idea has been tested. Step-by-step practical improvements, and evaluation of their impact, are less dramatic but usually more effective.

---

```{r}
# install.packages("gapminder")
library(gapminder)
```
```{r}
df <- gapminder
df
```

---

```{r}
glimpse(df)
```

---

```{r}
summary(df)
```

---

### Questions

* List questions based on this data.
* What do you want to see? 
* What kind of chart do you want to construct?


## Review {-}

* R on R Studio/Posit Cloud (RStudio Cloud)
* Three ways to run codes
  1. Console
  2. R Script
  3. Code Chunk in R Notebook
* Packages
  1. `tidyverse`
  2. `rmarkdown`
  3. `gapminder`
  
```{r include=FALSE}
library(tidyverse)
library(gapminder)
library(WDI)
```


---

#### EDA (A diagram from R4DS by H.W. and G.G.) {-}

![EDA from r4ds](data/data-science.png)

**Today**: R Markdown and `dplyr`

# Exploratory Data Analysis II

## R Markdown

What is R Markdown: https://vimeo.com/178485416

* Code Chunks
* Text
* YAML Metadata

---

### What is R Markdown and R Notebook

R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both

* save and execute code
* generate high quality reports that can be shared with an audience

R Notebooks are an implementation of Literate Programming that allows for direct interaction with R while producing a reproducible document with publication-quality output.

An **R Notebook** is an R Markdown document _with chunks that can be executed independently and interactively, with output visible immediately beneath the input_.

(Reference: [R Markdown: The Definitive Guide, 3.2 Notebook](https://bookdown.org/yihui/rmarkdown/notebook.html))

---

#### Two Goodies

* **Important**: Implementation of Reproducible Research and Literate Programming

* **Useful** to Render into Various Formats: R Notebook (HTML), R Markdown (HTML), PDF, MS Word, MS Powerpoint, Ioslides Presentation (HTML), Slidy Presentation (HTML), Beamer Presentation (PDF), etc.

---

### Reproducible Research and Literate Programming

#### Literate Programming by D. Knuth

Literate programming is an approach to programming introduced by Donald Knuth in which a program is given as an explanation of the program logic in a natural language, such as English, interspersed with snippets of macros and traditional source code, from which a compilable source code can be generated

#### [D. Knuth](https://www.brainyquote.com/quotes/donald_knuth_181634)
Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

---

#### Reproducible Research - Quote from a [Coursera Course](https://www.coursera.org/learn/reproducible-research)

Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them.  The need for reproducibility is increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary. In addition, reproducibility makes an analysis more useful to others because the data and code that actually conducted the analysis are available. 

---

#### R Markdown workflow, [R for Data Science](https://r4ds.had.co.nz/r-markdown-workflow.html#r-markdown-workflow)

R Markdown is also important because it so tightly integrates prose and code. This makes it a great analysis notebook because it lets you develop code and record your thoughts. It:

* Records what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!

* Supports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.

* Helps others understand your work. It is rare to do data analysis by yourself, and you’ll often be working as part of a team. A lab notebook helps you share why you did it with your colleagues or lab mates.

---

#### Records of EDA and Communication

1. Memo on a scratch paper: R Scripts
2. Record on a notebook: R Notebook (an R Markdown format)
3. Short paper or a digital communication: R Notebook
3. Paper or a report: R Markdown (html, pdf, MS Word, etc.)
4. Presentation (html, pdf, MS Powerpoint, etc.)
5. Publication of a Book
  * [BOOKDOWN: Write HTML, PDF, ePub, and Kindle books with R Markdown](https://bookdown.org). Free online document is provided in [pdf as well](https://bookdown.org/yihui/rmarkdown/rmarkdown.pdf)
  - [Arxive Page](https://bookdown.org/home/archive/) 

---

### Let's Get Started

1. Start R Studio - _Update R Studio if old_
2. Create a Project
3. Tool > Install Packages `rmarkdown`
    * Or on Console: `install.packages("rmarkdown")`
4. Tool > Install Packages `tinytex` (for pdf generation)
    * Alternatively, `install.packages('tinytex')`
    * If TeX is not installed: `tinytex::install_tinytex()`  \# install TinyTeX
      - If you are not sure, please check on `Terminal` in the left below pane: 
        + which latex, which mktexlsr
5. Let's try!  
    a. File > New File > R Notebook
    b. Save with a file name, say, test-notebook
    c. Preview by [Preview] button
    d. Run Code Chunk `plot(cars)` and then Preview again.
    e. Knit PDF, Word (and HTML)

---

### Templates

#### RNotebook_Template

Template to submit your assignment of this course: [`RNotebook_Template.nb.html`](https://icu-hsuzuki.github.io/da4r2022_note/RNotebook_Template.nb.html)

```
title: "Title of R Notebook"
author: "ID and Your Name"
date: "`r Sys.Date()`" 
output:
  html_notebook: null
```


##### YAML

* Change the title
* Write ID and your name
* Date is auto-generated and inserted. If you wish, you can replace "`r Sys.Date()`" by your favorite date style.

---

##### Code Chunk

* When you execute or run a code within the notebook, the results appear beneath the code.
* Try executing this chunk by clicking the Run button, a triangle pointing right, within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter (Win) or Cmd+Shift+Enter (Mac).
* Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Option+I (Win) or Cmd+Option+I (Mac).
* When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K (Win) or Cmd+Shift+K (Mac) to preview the HTML file).
* The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, _Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed_.

---

#### Testing R Markdown Formats

Various Output Formats: [`test-rmarkdown.nb.html`](https://icu-hsuzuki.github.io/da4r2022_note/test-rmarkdown.nb.html)

```
title: "Testing R Markdown Formats"
author: "DS-SL"
date: "`r Sys.Date()`"
output:
  html_notebook:
    number_sections: yes
  pdf_document: 
    number_sections: yes
  html_document:
    df_print: paged
    number_sections: yes
  word_document: 
    number_sections: yes
  powerpoint_presentation: default
  ioslides_presentation:
    widescreen: yes
    smaller: yes
  slidy_presentation: default
  beamer_presentation: default
```

---

#### Comments on Presentation Formats and Options

* For slides, a new slide starts at \#\#, the second-level heading.
* `---` is page break for presentation formats.
* For Word and Powerpoint, you can add your template. See the documents in References
  - Use R Markdown to create a Word document [similar for PowerPoint]
  - Save the rendered Word file as: `ref-doc-style.docx`
  - Edit the styles of the file `ref-doc-style.docx`
  - Add `ref-doc-style.docx` as reference_doc in YAML with indention as below

```
  word_document: 
    number_sections: yes
    reference_doc: ref-doc-style.docx
  powerpoint_presentation: 
    reference_doc: ref-ppt-style.pptx
```

* You can use `Output Options` at the bottom of the gear icon next to Preview/knit button.

---

### Markdown Language -- or use WYSIWYG editor

* Headers: \#, \#\#, \#\#\#, \#\#\#\#
* Lists: 1. 2. \ldots, * 
* Links: [linked phrase](http://example.com)
* Images: `![alt text](figures/filename.jpg)`
* Block quotes" > (block) 
* \LaTeX\  equations: e.g. `$\frac{a}{b}$` for $\frac{a}{b}$
* Horizontal rules: Three or more asterisks or dashes (*** or  - - - )
* Tables
* Footnotes
* Bibliographies and Citations
* Slide breaks
* _Italicized text_ by `_italic_`, **Bold text** by `**bold**`
* Superscripts, Subscripts, Strikethrough text

---

#### Visual R Markdown

>R Studio introduced Visual Editor towards the end of 2021. It seems to be stable but it is not perfect to go back and forth from the original editor using tags. I always use the original editor and I am confident on all the functions of it but I do not have much experience on Visual Editor. [My Note in QALL401 2021]
  
  * https://rstudio.github.io/visual-markdown-editing/

---

### References

* Posit Primers: [Report Reproducibly](https://rmarkdown.rstudio.com/lesson-1.html?_ga=2.60708591.317621277.1671142614-2004472742.1671142614)
* Markdown Quick Reference: Top Menu Bar \> Help \> Markdown Quick Reference
* Cheat Sheet (Top Menu Bar: Help \> Cheat Sheets): RMarkdown Cheat Sheet, RMarkdown Reference Guide
* Books:
  - In Textbook: [R for Data Science: Communicate](https://r4ds.had.co.nz/communicate-intro.html#communicate-intro)
  - [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/) by Yihui Xie, J. J. Allaire, Garrett Grolemund
  - [R Markdown Cookbook](https://bookdown.org/yihui/rmarkdown-cookbook/) by Yihui Xie, Christophe Dervieux, Emily Riederer
* Markdown: R Markdown is based on the Markdown language of Pandoc
  - [Pandoc's Markdown](https://pandoc.org/MANUAL.html#pandocs-markdown): Detailed Information
  - [Markdown Tutorials](https://www.markdowntutorial.com): Interactive Practicum
  - [DARING FIREBALL: Markdown](https://daringfireball.net/projects/markdown/) (detailed explanation and editor as Dingus)
* Post error messages to a web search engine.



## Data Transforamtion with `dplyr`

### `dplyr` [Overview](https://dplyr.tidyverse.org)

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

* `select()` picks variables based on their names.
* `filter()` picks cases based on their values.
* `mutate()` adds new variables that are functions of existing variables
* `summarise()` reduces multiple values down to a single summary.
* `arrange()` changes the ordering of the rows.
* `group_by()` takes an existing tbl and converts it into a grouped tbl.

You can learn more about them in vignette("dplyr"). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette("two-table").

If you are new to dplyr, the best place to start is [the data transformation chapter in R for data science](http://r4ds.had.co.nz/transform.html).

---

### [`select`](https://dplyr.tidyverse.org/reference/select.html): Subset columns using their names and types

Helper Function	| Use	| Example
---|-------|--------
-	| Columns except	| select(babynames, -prop)
:	| Columns between (inclusive)	| select(babynames, year:n)
contains() |	Columns that contains a string |	select(babynames, contains("n"))
ends_with()	| Columns that ends with a string	| select(babynames, ends_with("n"))
matches()	| Columns that matches a regex |	select(babynames, matches("n"))
num_range()	| Columns with a numerical suffix in the range | Not applicable with babynames
one_of() |	Columns whose name appear in the given set |	select(babynames, one_of(c("sex", "gender")))
starts_with()	| Columns that starts with a string	| select(babynames, starts_with("n"))

---

### [`filter`](https://dplyr.tidyverse.org/reference/filter.html): Subset rows using column values

Logical operator	| tests	| Example
--|-----|---
>	| Is x greater than y? |	x > y
>=	| Is x greater than or equal to y? |	x >= y
<	| Is x less than y?	| x < y
<=	| Is x less than or equal to y? | 	x <= y
==	| Is x equal to y? |	x == y
!=	| Is x not equal to y? |	x != y
is.na()	| Is x an NA?	| is.na(x)
!is.na() |	Is x not an NA? |	!is.na(x)

---

### [`arrange`](https://dplyr.tidyverse.org/reference/arrange.html) and `Pipe %>%`

* `arrange()` orders the rows of a data frame by the values of selected columns.

Unlike other `dplyr` verbs, `arrange()` largely ignores grouping; you need to explicitly mention grouping variables (`or use .by_group = TRUE) in order to group by them, and functions of variables are evaluated once per data frame, not once per group.

* [`pipes`](https://r4ds.had.co.nz/pipes.html) in R for Data Science.

---

### [`mutate`](https://dplyr.tidyverse.org/reference/mutate.html) 

* Create, modify, and delete columns

* Useful mutate functions

  - +, -, log(), etc., for their usual mathematical meanings

  - lead(), lag()

  - dense_rank(), min_rank(), percent_rank(), row_number(), cume_dist(), ntile()

  - cumsum(), cummean(), cummin(), cummax(), cumany(), cumall()

  - na_if(), coalesce()### `group_by()` and `summarise()`

---

### [`group_by`](https://dplyr.tidyverse.org/reference/group_by.html)

---

### [`summarise` or `summarize`](https://dplyr.tidyverse.org/reference/summarise.html)

#### Summary functions

So far our summarise() examples have relied on sum(), max(), and mean(). But you can use any function in summarise() so long as it meets one criteria: the function must take a vector of values as input and return a single value as output. Functions that do this are known as summary functions and they are common in the field of descriptive statistics. Some of the most useful summary functions include:

1. Measures of location - mean(x), median(x), quantile(x, 0.25), min(x), and max(x)
2. Measures of spread - sd(x), var(x), IQR(x), and mad(x)
3. Measures of position - first(x), nth(x, 2), and last(x)
4. Counts - n_distinct(x) and n(), which takes no arguments, and returns the size of the current group or data frame.
5. Counts and proportions of logical values - sum(!is.na(x)), which counts the number of TRUEs returned by a logical test; mean(y == 0), which returns the proportion of TRUEs returned by a logical test.


  - if_else(), recode(), case_when()

---

### Learn `dplyr` by Examples

#### Data `iris`

```{r}
iris
```


---

```{r}
summary(iris)
```

---

#### `select` 1 - columns 1, 2, 5

```{r}
select(iris, c(1,2,5))
```

---

#### `select` 2 - except Species

```{r}
select(iris, -Species)
```

---

#### `select` 3 - change column names

```{r}
select(iris, sl = Sepal.Length, sw = Sepal.Width, sp = Species)
```

---

#### `filter` - by names

```{r}
filter(iris, Species == "virginica")
```


---

#### `arrange`  - ascending and descending order

```{r}
arrange(iris, Sepal.Length, desc(Sepal.Width))
```

---

#### `mutate` - rank

```{r}
iris %>% mutate(sl_rank = min_rank(Sepal.Length)) %>% arrange(sl_rank)
```

---

#### `group_by` and `summarize`

```{r}
iris %>% 
  group_by(Species) %>% 
  summarize(sl = mean(Sepal.Length), sw = mean(Sepal.Width), 
  pl = mean(Petal.Length), pw = mean(Petal.Width))
```

* mean: `mean()` or `mean(x, na.rm = TRUE)` - arithmetic mean (average)
* median: `median()` or `median(x, na.rm = TRUE)` - mid value

---

For more examples see 

[dplr_iris](https://icu-hsuzuki.github.io/da4r2022_note/dplyr-iris.nb.html)


### References of `dplyr`

* Textbook: [R for Data Science, Part II Explore](https://r4ds.had.co.nz/wrangle-intro.html#wrangle-intro)

::: {.block}
#### RStudio Primers: See References in Moodle at the bottom

1. The Basics -- [r4ds: Explore, I](https://r4ds.had.co.nz/explore-intro.html#explore-intro)
  - [Visualization Basics](https://rstudio.cloud/learn/primers/1.1)
  - [Programming Basics](https://rstudio.cloud/learn/primers/1.2)
2. **Work with Data** -- [r4ds: Wrangle, I](https://r4ds.had.co.nz/wrangle-intro.html#wrangle-intro)
  - **Working with Tibbles**
  - **Isolating Data with dplyr**
  - **Deriving Information with dplyr**
3. Visualize Data -- [r4ds: Explore, II](https://r4ds.had.co.nz/explore-intro.html#explore-intro)
4. Tidy Your Data -- [r4ds: Wrangle, II](https://r4ds.had.co.nz/wrangle-intro.html#wrangle-intro)
5. Iterate -- [r4ds: Program](https://r4ds.had.co.nz/program-intro.html#program-intro)
6. Write Functions -- [r4ds: Program](https://r4ds.had.co.nz/program-intro.html#program-intro)
::: 

---

### Learn `dplyr` by Examples II - `gapminder`



#### `ggplot2` [Overview](https://ggplot2.tidyverse.org)

`ggplot2` is a system for declaratively creating graphics, based on [The Grammar of Graphics](https://amzn.to/2ef1eWp). You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

**Examples**
```
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
```
```
ggplot(data = mpg) + 
  geom_boxplot(mapping = aes(x = class, y = hwy))
```

**Template**
```
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```

---

#### Gapminder and R Package `gapminder`

> Gapminder was founded by Ola Rosling, Anna Rosling Rönnlund, and Hans Rosling

-   Gapminder: <https://www.gapminder.org>

    -   Test on Top: You are probably wrong about - upgrade your worldview
    -   Bubble Chart: <https://www.gapminder.org/tools/#$chart-type=bubbles&url=v1>
    -   Dallar Street: <https://www.gapminder.org/tools/#$chart-type=bubbles&url=v1>
    -   Data: <https://www.gapminder.org/data/>

-   R Package gapminder by Jennifer Bryan

    -   Package site: <https://CRAN.R-project.org/package=gapminder>
    -   Site: <https://github.com/jennybc/gapminder>
    -   Documents: <https://www.rdocumentation.org/packages/gapminder/versions/0.3.0>

-   Package Help `?gapminder` or `gapminder` in the search window of Help

    -   The main data frame gapminder has 1704 rows and 6 variables:
        -   country: factor with 142 levels
        -   continent: factor with 5 levels
        -   year: ranges from 1952 to 2007 in increments of 5 years
        -   lifeExp: life expectancy at birth, in years
        -   pop: population
        -   gdpPercap: GDP per capita (US\$, inflation-adjusted)

---

```{r packages, message=FALSE}
library(tidyverse)
library(gapminder)
library(WDI)
```

---

#### R Package `gapminder` data

```{r}
df <- gapminder
df
```

---

```{r}
glimpse(df)
```

---

```{r}
summary(df)
```

---

#### Tidyverse::ggplot

##### First Try - with failures

```{r}
ggplot(df, aes(x = year, y = lifeExp)) + geom_point()
```

---

```{r}
ggplot(df, aes(x = year, y = lifeExp)) + geom_line()
```

---

```{r}
ggplot(df, aes(x = year, y = lifeExp)) + geom_boxplot()
```

---

```{r}
typeof(pull(df, year)) # same as typeof(df$year)
```

---

```{r}
ggplot(df, aes(y = lifeExp, group = year)) + geom_boxplot()
```

---

##### Box Plot

```{r}
ggplot(df, aes(x = as_factor(year), y = lifeExp)) + geom_boxplot()
```

---

#### Applications of `dplyr`

##### `filter`

```{r}
df %>% filter(country == "Afghanistan") %>%
  ggplot(aes(x = year, y = lifeExp)) + geom_line()
```

---

```{r}
df %>% filter(country %in% c("Afghanistan", "Japan")) %>%
  ggplot(aes(x = year, y = lifeExp, color = country)) + geom_line()
```

---

```{r}
df %>% distinct(country) %>% pull()
```

---

```{r}
df %>% filter(country %in% c("Brazil", "Russia", "India", "China")) %>%
  ggplot(aes(x = year, y = lifeExp, color = country)) + geom_line()
```

Russian data is missing.

---

### Exercises

1.  Change `lifeExp` to `pop` and `gdpPercap` and do the same.
2.  Choose ASEAN countries and do the similar investigations.

-   Brunei, Cambodia, Indonesia, Laos, Malaysia, Myanmar, Philippines, Singapore.

3.  Choose several countries by yourself and do the similar investigations.

---

### `group_by` and `summarize`

Let us use the variable `continent` and summarize the data.

```{r}
df_lifeExp <- df %>% group_by(continent, year) %>% 
  summarize(mean_lifeExp = mean(lifeExp), median_lifeExp = median(lifeExp), max_lifeExp = max(lifeExp), min_lifeExp = min(lifeExp), .groups = "keep")
```

---

```{r}
df_lifeExp
```

---

```{r}
df %>% filter(year %in% c(1952, 1987, 2007)) %>%
  ggplot(aes(x=as_factor(year), y = lifeExp, fill = continent)) +
  geom_boxplot()
```

---

```{r}
df_lifeExp %>% ggplot(aes(x = year, y = mean_lifeExp, color = continent)) +
  geom_line()
```

---

```{r}
df_lifeExp %>% ggplot(aes(x = year, y = mean_lifeExp, color = continent, linetype = continent)) +
  geom_line()
```

---

```{r}
df_lifeExp %>% ggplot() +
  geom_line(aes(x = year, y = mean_lifeExp, color = continent)) + 
  geom_line(aes(x = year, y = median_lifeExp, linetype = continent))
```


## The Week Three Assignment (in Moodle)

**R Markdown and `dplyr`**

* Create an R Notebook of a Data Analysis containing the following and submit the rendered HTML file (eg. `a2_123456.nb.html`)
  1. create an R Notebook using the R Notebook Template in Moodle,  save as `a2_123456.Rmd`, 
  2. write your name and ID and the contents, 
  3. run each code block, 
  4. preview to create `a2_123456.nb.html`,
  5. submit  `a2_123456.nb.html` to Moodle.

1. Pick data from the built-in datasets besides `cars`. (`library(help = "datasets")` or go to the site [The R Datasets Package](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html))

    - Information of the data: Name, Description, Usage, Format, Source, References (Hint: ?cars)
    - Use `head()`, `str()`, ..., and create at least one chart using `ggplot2` - Code Chunk.
      + Don't forget to add `library(tidyverse)` in the first code chunk.
    - An observation of the chart - in your own words.

---

2. Load `gapminder` by `library(gapminder)`.

    - Choose `pop` or `gdpPercap`, or both, one country in the data, a group of countries in the data.
    - Create charts using ggplot2 with geom_line and the variables and countries chosen in 1. (See examples of the charts for `lifeExp`.)
    - Study the data as you like.
    - Observations and difficulties encountered.

**Due:** 2023-01-09 23:59:00. Submit your R Notebook file in Moodle (The Second Assignment). Due on Monday!

---

### Original Data? WDI?

```{r}
gapminder
```

---

#### WDI

* SP.DYN.LE00.IN: Life expectancy at birth, total (years)
* NY.GDP.PCAP.KD: GDP per capita (constant 2015 US$)
* SP.POP.TOTL: Population, total

```{r cash=TRUE}
df_wdi <- WDI(
  country = "all", 
  indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD")
)
```

---

```{r}
df_wdi
```

---

```{r cash=TRUE}
df_wdi_extra <- WDI(
  country = "all", 
  indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD"), 
  extra = TRUE
)
```

---

```{r}
df_wdi_extra
```


# Exploratory Data Analysis III  

## Importing Public Data, WDI

### Reviews and Previews

```{r message=FALSE}
library(tidyverse)
library(gapminder)
library(maps)
library(WDI)
library(readxl)
library(ggrepel)
```

* We have used `tidyverse` and `gapminder` already.
* If you have not installed `WDI`, install it.
* We will not use `ggrepel` but if you want to use it, install it.
* `maps` and `readxl` are bundled in `tidyverse` but need to be attached by `library`.

---

#### Gapminder Package Data

```{r}
df <- gapminder
df
```

---

#### `gdpPercap` of ASEAN countries

```{r eval=FALSE}
asean <- c("Brunei", "Cambodia", "Laos", "Myanmar", 
           "Philippines", "Indonesia", "Malaysia", "Singapore")
df %>% filter(country %in% asean) %>%
  ggplot(aes(x = year, y = gdpPercap, col = country)) + geom_line()
```

---

```{r echo=FALSE}
asean <- c("Cambodia", "Myanmar", 
           "Philippines", "Indonesia", "Malaysia", "Singapore")
df %>% filter(country %in% asean) %>%
  ggplot(aes(x = year, y = gdpPercap, col = country)) + geom_line()
```


---

```{r echo=TRUE}
df %>% filter(country %in% asean) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp, col = country)) + geom_point()
```

---

```{r echo=TRUE}
df %>% filter(country %in% asean) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp, col = country)) + 
  geom_point() + coord_trans(x = "log10", y = "identity")
```

$\log_{10}{100}$ = `r log10(100)`, $\log_{10}{1000}$ = `r log10(1000)`, $\log_{10}{10000}$ = `r log10(10000)`

<!-- $10^{2.5}$ = `r sqrt(10)^{5}`, $10^{3}$ = `r 10^{3}`, $10^{3.5}$ = `r sqrt(10)^7`,  -->

<!-- $10^{4}$ = `r 10^4`, $10^{4.5}$ = `r sqrt(10)^9`. -->

---

```{r gapminder-combined, eval=FALSE}
library(ggrepel)
df2007 <- df %>% filter(country %in% asean, year == 2007)
df %>% filter(country %in% asean) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp, col = country))+ 
  geom_line() + geom_label_repel(data = df2007, aes(label = country)) + geom_point()  +
  coord_trans(x = "log10", y = "identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1), legend.position = "none") +
  labs(title = "Life Expectancy vs GDP Per Capita of ASEAN Countries",
       subtitle = "Data: gapminder package", x = "GDP per Capita", y = "Life Expectancy")
```

---


```{r echo=FALSE}
library(ggrepel)
df2007 <- df %>% filter(country %in% asean, year == 2007)
df %>% filter(country %in% asean) %>%
  ggplot(aes(x = gdpPercap, y = lifeExp, col = country))+ 
  geom_line() + geom_label_repel(data = df2007, aes(label = country)) + geom_point()  +
  coord_trans(x = "log10", y = "identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1), legend.position = "none") +
  labs(title = "Life Expectancy vs GDP Per Capita of ASEAN Countries",
       subtitle = "Data: gapminder package", x = "GDP per Capita", y = "Life Expectancy")
```

---


```{r echo=FALSE}
library(tidyverse)
library(maps)
world_map <- map_data("world")
df %>%
  ggplot(aes(map_id = country)) + 
  geom_map(aes(fill = gdpPercap), map = world_map) + expand_limits(x = world_map$long, y = world_map$lat) +
  labs(title = "Gapminder Package Data", subtitle="World Map of GDP per Capita Data")
```

---

#### World Bank: World Development Indicators (WDI)

* SP.DYN.LE00.IN: Life expectancy at birth, total (years)
* NY.GDP.PCAP.KD: GDP per capita (constant 2015 US$)
* SP.POP.TOTL: Population, total

```{r cash=TRUE, eval=FALSE}
df_wdi <- WDI(
  country = "all", 
  indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD")
)
```

```{r echo=FALSE, message=FALSE, eval=FALSE}
write_csv(df_wdi, "data/df_wdi.csv")
```

```{r echo=FALSE, message=FALSE}
df_wdi <- read_csv("data/df_wdi.csv")
```


---

```{r}
df_wdi
```

---

```{r cash=TRUE, eval=FALSE}
df_wdi_extra <- WDI(
  country = "all", 
  indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD"), 
  extra = TRUE
)
```

```{r echo=FALSE, message=FALSE, eval=FALSE}
write_csv(df_wdi_extra, "data/df_wdi_extra.csv")
```

```{r echo=FALSE, message=FALSE}
df_wdi_extra <- read_csv("data/df_wdi_extra.csv")
```


---

```{r}
df_wdi_extra
```


---

### Exploratory Data Analysis

#### What is EDA (Posit Primers: [Visualise Data](https://posit.cloud/learn/primers/3.1))

1. EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

2. Generate questions about your data

3. Search for answers by visualising, transforming, and/or modeling your data

Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.

---

### Open and Public Data, World Bank

#### [Open Government Data Toolkit](http://opendatatoolkit.worldbank.org): [Open Data Defined](http://opendatatoolkit.worldbank.org/en/essentials.html)

The term **Open Data** has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.

1. The data must be _legally open_, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
2. The data must be _technically open_, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. To make Open Data easier to find, most organizations create and manage Open Data catalogs.

---

### World Bank: WDI - World Development Indicaters

* World Bank: https://www.worldbank.org
* [Who we are](https://www.worldbank.org/en/who-we-are):
  - To end extreme poverty: By reducing the share of the global population that lives in extreme poverty to 3 percent by 2030.
  - To promote shared prosperity: By increasing the incomes of the poorest 40 percent of people in every country. 
* World Bank Open Data: https://data.worldbank.org
  - Data Bank, World Development Indicators, etc.
* [World Development Indicators (WDI)](https://datatopics.worldbank.org/world-development-indicators/) : the World Bank’s premier compilation of cross-country comparable data on development; 1400 time series indicators
  - Themes: Poverty and Inequality, People, Environment, Economy, States and Markets, Global Links
  - Open Data & DataBank: Explore data, Query database
  - Bulk Download: Excel, CSV
  - API Documentation
  
---

### R Package [WDI](https://CRAN.R-project.org/package=WDI)

* [WDI](https://CRAN.R-project.org/package=WDI): World Development Indicators and Other World Bank Data
* Search and download data from over 40 databases hosted by the World Bank, including the World Development Indicators ('WDI'), International Debt Statistics, Doing Business, Human Capital Index, and Sub-national Poverty indicators.
* Version: 2.7.4
* Materials:	[README](https://cran.r-project.org/web/packages/WDI/readme/README.html)   - _usage_
  - [NEWS](https://cran.r-project.org/web/packages/WDI/news/news.html) - _version history_
* Published: 2021-04-06
* README: https://cran.r-project.org/web/packages/WDI/readme/README.html
* Reference manual:	[WDI.pdf](https://cran.r-project.org/web/packages/WDI/WDI.pdf)

---

### Function WDI

* **Usage**

```
WDI(country = "all",
    indicator = "NY.GDP.PCAP.KD",
    start = 1960,
    end = 2020,
    extra = FALSE,
    cache = NULL)
```

* **Arguments** See Help!
  - country: Vector of countries (ISO-2 character codes, e.g. "BR", "US", "CA", or "all") 
  - indicator: If you supply a named vector, the indicators will be automatically renamed: `c('women_private_sector' = 'BI.PWK.PRVS.FE.ZS')`

---

### Function WDIsearch

```{r}
library(WDI)
```
```{r}
WDIsearch(string = "NY.GDP.PCAP.KD", 
          field = "indicator", cache = NULL)
```

---

```{r}
WDIsearch(string = "population", 
          field = "name", short=FALSE, cache = NULL)
```

---

```
WDIsearch(string = "NY.GDP.PCAP.KD", 
  field = "indicator", short = FALSE, cache = NULL)
```
```
WDIsearch(string = "gdp", 
  field = "name", short = TRUE, cache = NULL) 
```

---

### Bulk Downloads at WDI site

WDIbulk downloads the zip file of Bulk Downloads in [WDI site](https://datatopics.worldbank.org/world-development-indicators/) , it is a list containing 6 data frames: Data, Country, Series, Country-Series, Series-Time, FootNote.

---

### WDIcache

Download an updated list of available WDI indicators from the World Bank website. Returns a list for use in the WDIsearch function.

```{r widcache, cash=TRUE, eval=FALSE}
wdi_cache <- WDIcache()
```

Downloading all series information from the World Bank website can take time. The WDI package ships with a local data object with information on all the series available on 2012-06-18. You can update this database by retrieving a new list using `WDIcache`, and then feeding the resulting object to `WDIsearch` via the cache argument.

```{r echo=FALSE, message=FALSE, eval=FALSE}
write_rds(wdi_cache, "data/wdi_cache.RData", refhook = NULL)
```

```{r echo=FALSE, message=FALSE, eval=FALSE}
wdi_cache <- read_rds("data/wdi_cache.RData", refhook = NULL)
```



---

```{r eval=FALSE}
wdi_cache
```


---

### WDI_data

List of 2 data frames

The first character matrix includes a full list of WDI series. This list is updated semi-regularly. Users can refresh the list manually using the 'WDIcache()' function and search in the updated list using the 'cache' argument.


<!-- ```{r} -->
<!-- glimpse(WDI_data) -->
<!-- ``` -->

<!-- --- -->

<!-- ```{r} -->
<!-- WDI_data$series -->
<!-- ``` -->

<!-- --- -->

<!-- ```{r} -->
<!-- WDI_data$country -->
<!-- ``` -->

```{r}
WDI_data$country  %>% filter(country == "Japan")
```

---

```{r}
WDIsearch(string = "gdp", 
  field = "name", short = FALSE, cache = NULL) #cache = wdi_cache
```

---

### World Development Indicators - Summary

Find indicators:

1. `WDIsearch(string = "gdp", field = "name", short = FALSE, cache = NULL)`
  - `WDIsearch(string = "gdp", field = "name", short = FALSE, cache = wdi_cache)`
  - `WDIsearch(string = "NY.GDP.PCAP.KD", field = "indicator", short = FALSE, cache = NULL)`
2. [WDI](https://datatopics.worldbank.org/world-development-indicators/): Data Themes
3. Browse by Indicators: https://data.worldbank.org/indicator
   - Featured Indicators or All Indicators
   - Obtain the indicator from the detail or the URL

---

#### Example: CO2 emissions (metric tons per capita)

* ID: EN.ATM.CO2E.PC
* URL: https://data.worldbank.org/indicator/EN.ATM.CO2E.PC

```{r}
WDIsearch(string = "EN.ATM.CO2E.PC", field = "indicator", 
          short = FALSE, cache = NULL) #cache = wdi_cache
```

```{r}
WDIsearch(string = "EN.ATM.CO2E.PC", field = "indicator", 
          short = FALSE, cache = NULL) %>% pull(description) #cache = wdi_cache
```

* Source: Climate Watch. 2020. GHG Emissions. Washington, DC: World Resources Institute. Available at: climatewatchdata.org/ghg-emissions. See SP.POP.TOTL for the denominator's source.




---

```{r cash=TRUE, eval=FALSE}
co2pcap <- WDI(country = "all", indicator = "EN.ATM.CO2E.PC", start = 1960, end = NULL, extra = TRUE, cache = NULL) #cache = wdi_cache
```

```{r echo=FALSE, message=FALSE, eval=FALSE}
write_csv(co2pcap, "data/co2pcap.csv")
```

```{r echo=FALSE, message=FALSE, eval=FALSE}
co2pcap <- read_csv("data/co2pcap.csv")
```

```{r}
co2pcap
```


---

```{r}
write_csv(co2pcap, "data/co2pcap.csv")
```


---

```{r}
co2pcap %>% filter(country %in% c("World", "Japan", "United States", "China")) %>%
  ggplot(aes(x = year, y = EN.ATM.CO2E.PC, color = country)) + 
  geom_line()
```

---

```{r}
co2pcap %>% filter(!is.na(EN.ATM.CO2E.PC)) %>% pull(year) %>% summary()
```




---

```{r}
co2pcap %>% 
  filter(country %in% c("World", "Japan", "United States", "China"), year %in% 1990:2019) %>%
  ggplot(aes(x = year, y = EN.ATM.CO2E.PC, color = country)) + 
  geom_line()
```

---

```{r}
co2pcap %>% 
  filter(income != "Aggregates", year == 2019) %>%
  ggplot(aes(x = income, y = EN.ATM.CO2E.PC, fill = income)) + 
  geom_boxplot()
```

---

```{r}
co2pcap %>% 
  filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
  ggplot(aes(x = income, y = EN.ATM.CO2E.PC, fill = income)) + 
  geom_boxplot()
```

* What is `boxplot`: https://vimeo.com/222358034

---

```{r}
co2pcap %>% 
  filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
  group_by(income) %>%
  summarize(min = min(EN.ATM.CO2E.PC), med = median(EN.ATM.CO2E.PC), max = max(EN.ATM.CO2E.PC), IQR = IQR(EN.ATM.CO2E.PC), n = n())
```

---
```{r}
co2pcap %>% 
  filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
  filter(!income %in% c("High income", "Low income", "Lower middle income", "Upper middle income"))
```

```{r}
co2pcap %>% 
  filter(income != "Aggregates", year == 2019) %>%
  filter(income == "Not classified")
```

---

```{r echo=TRUE, eval=FALSE}
co2pcap %>% 
  filter(income != "Aggregates", year == 2019) %>%
  ggplot(aes(map_id = country)) + 
  geom_map(aes(fill = income), map = world_map) + expand_limits(x = world_map$long, y = world_map$lat) +
  labs(title = "Income Levels in 2019")
```
---

```{r echo=FALSE}
co2pcap %>% 
  filter(income != "Aggregates", year == 2019) %>%
  ggplot(aes(map_id = country)) + 
  geom_map(aes(fill = income), map = world_map) + expand_limits(x = world_map$long, y = world_map$lat) +
  labs(title = "Income Levels in 2019")
```

---

```{r}
co2pcap %>% distinct(country)
```

---

```{r}
world_map %>% distinct(region)
```

---

```{r}
world_map0 <- world_map %>% 
  mutate(region = case_when(region == "Macedonia" ~ "North Macedonia",
                            region == "Ivory Coast"  ~ "Cote d'Ivoire",
                            region == "Democratic Republic of the Congo"  ~ "Congo, Dem. Rep.",
                            region == "Republic of Congo" ~  "Congo, Rep.",
                            region == "UK" ~  "United Kingdom",
                            region == "USA" ~  "United States",
                            region == "Laos" ~  "Lao PDR",
                            region == "Slovakia" ~  "Slovak Republic",
                            region == "Saint Lucia" ~  "St. Lucia",
                            region == "Kyrgyzstan"  ~  "Kyrgyz Republic",
                            region == "Micronesia" ~ "Micronesia, Fed. Sts.",
                            region == "Swaziland"  ~ "Eswatini", 
                            region == "Virgin Islands"  ~ "Virgin Islands (U.S.)", 
                            region == "Russia" ~ "Russian Federation", 
                            region == "Egypt" ~ "Egypt, Arab Rep.",
                            region == "South Korea" ~ "Korea, Rep.",
                            region == "North Korea" ~ "Korea, Dem. People's Rep.",
                            region == "Iran" ~ "Iran, Islamic Rep.",
                            region == "Brunei" ~ "Brunei Darussalam",
                            region == "Venezuela" ~ "Venezuela, RB",
                            region == "Yemen" ~ "Yemen, Rep.",
                            region == "Bahamas" ~ "Bahamas, The",
                            region == "Syria" ~ "Syrian Arab Republic",
                            region == "Turkey" ~ "Turkiye",
                            region == "Cape Verde" ~ "Cabo Verde",
                            region == "Gambia" ~ "Gambia, The",
                            region == "Czech Republic" ~ "Czechia",
                            TRUE ~ region))
```

---

```{r}
write_csv(world_map0, "data/world_map0.csv")
```

```{r eval=FALSE}
map0_url <- "https://icu-hsuzuki.github.io/da4r2022_note/data/world_map0.csv"
world_map0 <- read_csv(map0_url)
```


---

```{r}
co2pcap %>% filter(income != "Aggregates", year == 2019) %>% 
  anti_join(world_map0, by = c("country"="region"))
```

---

```{r}
world_map0 %>% anti_join(co2pcap, by = c("region"="country")) %>% distinct(region) %>% arrange(region)
```

---



```{r}
world_map0 %>% left_join(iso3166, by = c("region" = "ISOname")) %>%
  filter(is.na(a2)) %>% distinct(region)
```


---


```{r echo=FALSE}
co2pcap %>% 
  filter(income != "Aggregates", year == 2019) %>%
  ggplot(aes(map_id = country)) + 
  geom_map(aes(fill = income), map = world_map0) + expand_limits(x = world_map$long, y = world_map$lat) +
  labs(title = "Income Levels in 2019")
```




## Data Visualization and `ggplot2'


### Learning Resouces

* Posit Primers:
  - [Visualize Data](https://posit.cloud/learn/primers/3): Learn how to use ggplot2 to make any type of plot with your data. Then learn the best ways to visualize patterns within values and relationships between variables.
* [r4ds: Data Visualization](https://r4ds.had.co.nz/data-visualisation.html#data-visualisation)

---

### Exploratory Data Analysis

#### What is EDA?

EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

1. Generate questions about your data

2. Search for answers by visualising, transforming, and/or modeling your data

3. Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.

---

#### Two useful questions

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

1. What type of variation occurs within my variables?
2. What type of covariation occurs between my variables?

The rest of this tutorial will look at these two questions. To make the discussion easier, let’s define some terms…

---

### Data Visualization

### `ggplot2` Basics

![visualization](data/visualization.png){width=75%}


---

### Example: World Inequility Report - WIR2022

* World Inequality Report: https://wir2022.wid.world/
* Executive Summary: https://wir2022.wid.world/executive-summary/
* Methodology: https://wir2022.wid.world/methodology/
* Data URL: https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx

```{r}
library(readxl)
```

```{r summary-data, cash = TRUE, eval = FALSE}
url_summary <- "https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx"
download.file(url = url_summary, destfile = "data/WIR2022s.xlsx") 
excel_sheets("data/WIR2022s.xlsx")
```

### F14: Global carbon inequality, 2019. Group contribution to world emissions (%)

Note that the sheet name of F14 has period at the end. 

```{r data-f14, cash = TRUE}
df_f14 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F14.")
df_f14
```

* `\n` for line break in the title.

---

#### Categorical vs Continuous Value

```{r}
df_f14 %>% 
  ggplot(aes(x = Group, y = Share)) +
  geom_col()
```

---

```{r}
df_f14 %>% 
  ggplot(aes(x = Group, y = Share)) +
  geom_col(width = 0.5, fill = scales::hue_pal()(1)[1]) + 
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 14. Global carbon inequality, \n2019 Group contribution to world emissions (%)", 
       x = "", y = "Share of world emissions (%)")
```

---

#### Memo

* `width = 0.5`: width of bars
* `fill = scales::hue_pal()(1)[1])`: hue scale
  - https://ggplot2.tidyverse.org/reference/scale_hue.html.
* `scale_y_continuous(labels = scales::percent_format(accuracy = 1))`: percent format
  - if accuracy = 0.1, we have 10.0% etc.
* `labs(title = "Figure 14. Global carbon inequality, \n2019 Group contribution to world emissions (%)",
   x = "", y = "Share of world emissions (%)")`
  - title = "": `\n` is for line feed
  - x, y: labels of x-axis and y-axis

---

### F12: Female share in global labor incomes, 1990-2020

```{r data-f12, cash = TRUE}
df_f12 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F12")
df_f12
```

---

```{r}
df_f12 %>% 
  select(year = "Data needs to be updated", value = ...2) %>%
  filter(!is.na(year)) %>%
  ggplot(aes(x = year, y = value)) +
  geom_col(width = 0.5, fill = scales::hue_pal()(2)[2])
```

---

```{r}
df_f12 %>% 
  select(year = "Data needs to be updated", value = ...2) %>%
  filter(!is.na(year)) %>%
  ggplot(aes(x = year, y = value)) +
  geom_col(width = 0.5, fill = scales::hue_pal()(2)[2]) +
  geom_hline(yintercept = 0.5, linetype = 2, colour = scales::hue_pal()(2)[1]) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 12. Female share in global labor incomes, 1990-2020", 
        x = "", y = "") +
  annotate("text", x = 1, y = 0.48, label = "Gender parity", size = 3) +
  annotate("text", x = 5.2, y = 0.47, label = stringr::str_wrap("Women make only 35% of global labor incomes, men make the remaining  65%.", width = 40), size = 3)
```


### F1: Global income and wealth inequality, 2021

```{r data-f1, cash = TRUE}
df_f1 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F1")
df_f1
```

---

```{r echo=FALSE}
df_f1_rev <- df_f1 %>% select(cat = ...1, 2:4) %>%
  pivot_longer(2:4, names_to = "group", values_to = "value")
df_f1_rev
```

```{r}
df_f1_rev %>%
  ggplot(aes(x = cat, y = value, fill = group)) +
  geom_col(position = "dodge")
```

---

### References of `ggplot2`

* Textbook: [R for Data Science, Data Visualization](https://r4ds.had.co.nz/data-visualisation.html#data-visualisation)

#### RStudio Primers: See References in Moodle at the bottom

**Visualize Data**

  - Exploratory Data Analysis
  - Bar Charts
  - Histograms
  - Boxplots and Counts
  - Scatterplots
  - Line Plots
  - Overplotting and Big Data
  - Customize Your Plots



## The Week Four Assignment (in Moodle)

**WDI and `ggplot2`**

* Create an R Notebook of a Data Analysis containing the following and submit the rendered HTML file (eg. `a3_123456.nb.html`  by replacing 123456 with your ID)
  1. create an R Notebook using the R Notebook Template in Moodle,  save as `a3_123456.Rmd`, 
  2. write your name and ID and the contents, 
  3. run each code block, 
  4. preview to create `a3_123456.nb.html`,
  5. submit  `a3_123456.nb.html` to Moodle.

1. Choose at least one indicator of WDI

    - Information of the data: Name, Indicator, Description, Source, etc.
    - Download the data with `WDI`
    - Explain why you chose the indicator
    - List questions you want to study

---

2. Explore the data using visualization using `ggplot2`

    - Use a histogram (geom_histogram), boxplot (geom_boxplot), a scatter plot (geom_point), a line plot (geom_line)
    - For at least one chart, add title, and labels of axis, and add an explanation of it

3. Observations and difficulties encountered.

**Due:** 2023-01-16 23:59:00. Submit your R Notebook file in Moodle (The Third Assignment). Due on Monday!

# Exploratory Data Analysis (EDA) IV  

## Tidy Data

### Reviews and Previews

### Example: World Inequility Report - WIR2022

* World Inequality Report: https://wir2022.wid.world/
* Executive Summary: https://wir2022.wid.world/executive-summary/
* Methodology: https://wir2022.wid.world/methodology/
* Data URL: https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx

```{r}
library(tidyverse)
library(readxl)
```

```{r summary-data2, cash = TRUE, eval = FALSE}
url_summary <- "https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx"
download.file(url = url_summary, destfile = "data/WIR2022s.xlsx") 
```

```{r}
excel_sheets("data/WIR2022s.xlsx")
```

---

### F1: Global income and wealth inequality, 2021

```{r data-f1-2, cash = TRUE, message = FALSE}
df_f1 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F1")
df_f1
```



```{r echo=FALSE}
df_f1_rev <- df_f1 %>% select(cat = ...1, 2:4) %>%
  pivot_longer(2:4, names_to = "group", values_to = "value")
df_f1_rev
```

---

```{r}
df_f1_rev %>%
  ggplot(aes(x = cat, y = value, fill = group)) +
  geom_col(position = "dodge")
```

---

### References of `tidyr`

* Textbook: [R for Data Science,Tidy Data](https://r4ds.had.co.nz/tidy-data.html#tidy-data)

#### RStudio Primers: See References in Moodle at the bottom

**Tidy Your Data**

  - Reshape Data
  - Separate and Unite Columns
  - Join Data Sets
  
---

### Variables, values, and observations: Definitions

* A **variable** is a quantity, quality, or property that you can measure.
* A **value** is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
* An **observation** or **case** is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. I’ll sometimes refer to an observation as a case or data point.
* **Tabular data** is a table of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own cell, each variable in its own column, and each observation in its own row.
* So far, all of the data that you’ve seen has been tidy. In real-life, most data isn’t tidy, so we’ll come back to these ideas again in Data Wrangling.

---

### Tidy Data

> “Data comes in many formats, but R prefers just one: tidy data.” — Garrett Grolemund

Data can come in a variety of formats, but one format is easier to use in R than the others. This format is known as tidy data. A data set is tidy if:

1. Each variable is in its own column
2. Each observation is in its own row
3. Each value is in its own cell (this follows from #1 and #2)

> “Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham

> “all happy families are all alike; each unhappy family is unhappy in its own way” - Tolstoy's Anna Karenina

---

### `tidyr` Basics

```{r, echo=FALSE, out.width="100%"}
knitr::include_graphics("data/tidy-1.png")
```

1. Each variable is in its own column
2. Each observation is in its own row

---

### Pivot data from wide to long: [`pivot_longer()`](https://tidyr.tidyverse.org/reference/pivot_longer.html)

```
pivot_longer(data, cols = <columns to pivot into longer format>,
  names_to = <name of the new character column>, # e.g. "group", "category", "class"
  values_to = <name of the column the values of cells go to>) # e.g. "value", "n"
```

```{r}
df_f1
```

```{r}
(df_f1_rev <- df_f1 %>% pivot_longer(-1, names_to = "group", values_to = "value"))
```

---

```{r}
df_f1_rev %>% 
  ggplot(aes(x = ...1, y = value, fill = group)) +
  geom_col(position = "dodge")
```

---

```{r eval=FALSE}
df_f1_rev %>% filter(group != "Top 1%") %>%
  ggplot() +
  geom_col(aes(x = ...1, y = value, fill = group), position = "dodge") +
  geom_text(aes(x = ...1, y = value, group = group, 
            label = scales::label_percent(accuracy=1)(value)), 
            position = position_dodge(width = 0.9)) + 
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 1. Global income and wealth inequality, 2021",
       x = "", y = "Share of total income or wealth", fill = "")
```

---

```{r echo=FALSE}
df_f1_rev %>% filter(group != "Top 1%") %>%
  ggplot() +
  geom_col(aes(x = ...1, y = value, fill = group), position = "dodge") +
  geom_text(aes(x = ...1, y = value, group = group, 
            label = scales::label_percent(accuracy=1)(value)), 
            position = position_dodge(width = 0.9)) + 
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 1. Global income and wealth inequality, 2021",
       x = "", y = "Share of total income or wealth", fill = "")
```
**Interpretation**: The global bottom 50% captures 8.5% of total income measured at Purchasing Power Parity (PPP). The global bottom 50% owns 2% of wealth (at Purchasing Power Parity). The global top 10% owns 76% of total Household wealth and captures 52% of total income in 2021. Note that top wealth holders are not necessarily top income holders. Incomes are measured after the operation of pension and unemployment systems and before taxes and transfers.  
**Sources and series**: wir2022.wid.world/methodology.

---

### F2: The poorest half lags behind: Bottom 50%, middle 40% and top 10% income shares across the world in 2021

```{r}
df_f2 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F2")
df_f2
```

---

```{r}
df_f2 %>% pivot_longer(cols = 3:5, names_to = "group", values_to = "value")
```

---

```{r}
df_f2 %>% pivot_longer(cols = 3:5, names_to = "group", values_to = "value") %>%
  ggplot(aes(x = iso, y = value, fill = group)) +
  geom_col(position = "dodge")
```

---

### Pivot data from long to wide: 
[`pivot_wider()`](https://tidyr.tidyverse.org/reference/pivot_wider.html)
In Console: vignette("pivot") 

```
pivot_wider(data, 
  names_from = <name of the column (or columns) to get the name of the output column>,
  values_from = <name of the column to get the value of the output>) 
```


---

```{r echo = FALSE}
df_f2 %>% pivot_longer(cols = 3:5, names_to = "group", values_to = "value")
```

```
pivot_wider(data, names_from = group, values_from = value) 
```

---

### Practice: F4 and F13

F4 and F13 are similar. Please use `pivot_longer` to tidy the data and create charts.

* **References**: https://ds-sl.github.io/data-analysis/wir2022.nb.html

#### Done Last Week

* F12: Female share in global labor incomes, 1990-2020
* F14: Global carbon inequality, 2019. Group contribution to world emissions (%)

---

### F3: Top 10/Bottom 50 income gaps across the world, 2021


```{r}
df_f3 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F3")
df_f3
```

---

### F3: Top 10/Bottom 50 income gaps across the world, 2021 - Original

```{r, echo=FALSE, out.width="100%"}
knitr::include_graphics("data/F3.png")
```

---

* To 10 / Bottom 50 ratio has 5 classes: 5-12, 12-13, 13-16, 16-19, 19-140

```{r}
df_f3$T10B50 %>% summary()
```

---

```{r}
df_f3 %>% ggplot() + geom_histogram(aes(T10B50))
```

---

```{r}
df_f3 %>% arrange(desc(T10B50))
```

---

```{r}
df_f3 %>% 
  mutate(`Top 10 Bottom 50 Ratio` = cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), 
                                        include.lowest = FALSE)) 
```

---

```{r}
world_map <- map_data("world")
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), 
                                        include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + 
  geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), map = world_map) + 
  expand_limits(x = world_map$long, y = world_map$lat)
```

---

```{r}
world_map_wir <- world_map
world_map_wir$region[
  world_map_wir$region=="Democratic Republic of the Congo"]<-"DR Congo"
world_map_wir$region[world_map_wir$region=="Republic of Congo"]<-"Congo"
world_map_wir$region[world_map_wir$region=="Ivory Coast"]<-"Cote dIvoire"
world_map_wir$region[world_map_wir$region=="Vietnam"]<-"Viet Nam"
world_map_wir$region[world_map_wir$region=="Russia"]<-"Russian Federation"
world_map_wir$region[world_map_wir$region=="South Korea"]<-"Korea"
world_map_wir$region[world_map_wir$region=="UK"]<-"United Kingdom"
world_map_wir$region[world_map_wir$region=="Brunei"]<-"Brunei Darussalam"
world_map_wir$region[world_map_wir$region=="Laos"]<-"Lao PDR"
world_map_wir$region[world_map_wir$region=="Cote dIvoire"]<-"Cote d'Ivoire"
world_map_wir$region[world_map_wir$region=="Cape Verde"]<- "Cabo Verde"
world_map_wir$region[world_map_wir$region=="Syria"]<- "Syrian Arab Republic"
world_map_wir$region[world_map_wir$region=="Trinidad"]<- "Trinidad and Tobago"
world_map_wir$region[world_map_wir$region=="Tobago"]<- "Trinidad and Tobago"
```
  
---

```{r}
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = 
    cut(T10B50, breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + 
  geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), 
    map = world_map_wir) + 
    expand_limits(x = world_map_wir$long, y = world_map_wir$lat)
```



---

```{r}
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = 
    cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), 
    map = world_map_wir) + expand_limits(x = world_map_wir$long, y = world_map_wir$lat) + 
  coord_map("orthographic", orientation = c(25, 60, 0))
```

---

```{r}
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = 
  cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), 
    map = world_map_wir) + expand_limits(x = world_map_wir$long, y = world_map_wir$lat) + 
  coord_map("orthographic", orientation = c(15, -80, 0))
```

---


```{r}
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = 
  cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), 
    map = world_map_wir) + 
  expand_limits(x = world_map_wir$long, y = world_map_wir$lat)
```

---

```{r eval = FALSE}
df_f3 %>% 
  mutate(`Top 10 Bottom 50 Ratio` = 
        cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + 
  geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), map = world_map_wir) + 
  expand_limits(x = world_map_wir$long, y = world_map_wir$lat)  + 
  labs(title = "Figure 3. Top 10/Bottom 50 income gaps across the world, 2021",
       x = "", y = "", fill = "Top 10/Bottom 50 ratio") +
  theme(legend.position="bottom", 
        axis.text.x=element_blank(), axis.ticks.x=element_blank(),
        axis.text.y=element_blank(), axis.ticks.y=element_blank()) + 
  scale_fill_brewer(palette='YlOrRd')
```

---

```{r echo = FALSE}
df_f3 %>% 
  mutate(`Top 10 Bottom 50 Ratio` = cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
  ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), map = world_map_wir) + expand_limits(x = world_map_wir$long, y = world_map_wir$lat)  + 
  labs(title = "Figure 3. Top 10/Bottom 50 income gaps across the world, 2021",
       x = "", y = "", fill = "Top 10/Bottom 50 ratio") +
  theme(legend.position="bottom", 
        axis.text.x=element_blank(), axis.ticks.x=element_blank(),
        axis.text.y=element_blank(), axis.ticks.y=element_blank()) + 
  scale_fill_brewer(palette='YlOrRd')
```

---

```{r}
df_f3 %>% anti_join(world_map_wir, by = c("Country" = "region"))
```

**Filtering joins**

* `anti_join(x,y, ...)`: return all rows from x without a match in y.
* `semi_join(x,y, ...)`: return all rows from x with a match in y.

Check `dplyr` cheat sheet, and Posit Primers Tidy Data.

---

### Remaining Charts

* F5: Global income inequality: T10/B50 ratio, 1820-2020 - fit curve
* F9: Average annual wealth growth rate, 1995-2021 - fit curve + alpha
* F7: Global income inequality, 1820-2020 - pivot + fit curve
* F10: The share of wealth owned by the global 0.1% and billionaires, 2021 - pivot + fit curve


* F6: Global income inequality: Between vs. Within country inequality (Theil index), 1820-2020 - pivot + area

* F11: Top 1% vs bottom 50% wealth shares in Western Europe and the US, 1910-2020 - pivot name_sep + fit curve
* F8: The rise of private versus the decline of public wealth in rich countries, 1970-2020 - rename + pivot + pivot + fit curve

* F15: Per capita emissions acriss the world, 2019 - add row names + dodge


---

### F5: Global income inequality: T10/B50 ratio, 1820-2020

```{r data-f5, cash = TRUE}
(df_f5 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F5"))
```

---

```{r}
df_f5 %>% ggplot(aes(x = y, y = t10b50)) + geom_line() + geom_smooth(span=0.25, se=FALSE)
```

---

### F9: Average annual wealth growth rate, 1995-2021 - fit curve + alpha

```{r data-f9, cash = TRUE}
df_f9 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F9"); df_f9
```

---

```{r}
df_f9 %>% 
  ggplot(aes(x = p, y = `Wealth growth 1995-2021`)) + geom_smooth(span = 0.30, se = FALSE)
```

---

### F7: Global income inequality, 1820-2020 - pivot + fit curve

```{r data-f7, cash = TRUE}
df_f7 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F7"); df_f7
```

---

```{r}
df_f7 %>% 
  pivot_longer(cols = 2:4, names_to = "type", values_to = "value") %>%
  ggplot(aes(x = y, y = value, color = type)) +
  stat_smooth(formula = y~x, method = "loess", span = 0.25, se = FALSE)
```

---

### F10: The share of wealth owned by the global 0.1% and billionaires, 2021 - pivot + fit curve

```{r data-f10, cash = TRUE}
df_f10 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F10"); df_f10
```

---

```{r}
df_f10 %>% 
  select(year, "Global Billionaire Wealth" = bn_hhweal, "Top 0.01%" = top0.1_hhweal) %>%
  pivot_longer(!year, names_to = "group",".value", values_to = "value")
```

---

```{r}
df_f10 %>% 
  select(year, "Global Billionaire Wealth" = bn_hhweal, "Top 0.01%" = top0.1_hhweal) %>%
  pivot_longer(!year, names_to = "group",".value", values_to = "value") %>%
  ggplot() +
  stat_smooth(aes(x = year, y = value, color = group), formula = y~x, method = "loess", span = 0.25, se = FALSE)
```

---

### F6: Global income inequality: Between vs. Within country inequality (Theil index), 1820-2020 - pivot + area

```{r data-f6, cash = TRUE}
df_f6 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F6"); df_f6
```

---

```{r eval =FALSE}
df_f6 %>% select(year = "...1", 2:3) %>%
  pivot_longer(cols = 2:3, names_to = "type", values_to = "value") %>%
  mutate(types = factor(type, 
      levels = c("Within-country inequality", "Between-country inequality"))) %>%
  ggplot(aes(x = year, y = value, fill = types)) +
  geom_area() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks = round(seq(1820, 2020, by = 20),1)) + 
  scale_fill_manual(values=rev(scales::hue_pal()(2)), 
      labels = function(x) str_wrap(x, width = 15)) +
  labs(title = "Figure 6. Global income inequality: 
       \nBetween vs. within country inequality (Theil index), 1820-2020",
       x = "", y = "Share of global inequality (% of total Theil index)", fill = "") + 
  annotate("text", x = 1850, y = 0.28, 
      label = stringr::str_wrap("1820: Between country inequality represents 11% 
                                of global inequality", width = 20), size = 3) + 
  annotate("text", x = 1980, y = 0.70, 
      label = stringr::str_wrap("1980: Between country inequality represents 57% 
                                of global inequality", width = 20), size = 3) +
  annotate("text", x = 1990, y = 0.30, 
      label = stringr::str_wrap("2020: Between country inequality represents 32% 
                                of global inequality", width = 20), size = 3)
```

---


```{r echo =FALSE}
df_f6 %>% select(year = "...1", 2:3) %>%
  pivot_longer(cols = 2:3, names_to = "type", values_to = "value") %>%
  mutate(types = factor(type, 
      levels = c("Within-country inequality", "Between-country inequality"))) %>%
  ggplot(aes(x = year, y = value, fill = types)) +
  geom_area() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(breaks = round(seq(1820, 2020, by = 20),1)) + 
  scale_fill_manual(values=rev(scales::hue_pal()(2)), 
      labels = function(x) str_wrap(x, width = 15)) +
  labs(title = "Figure 6. Global income inequality: 
       \nBetween vs. within country inequality (Theil index), 1820-2020",
       x = "", y = "Share of global inequality (% of total Theil index)", fill = "") + 
  annotate("text", x = 1850, y = 0.28, 
      label = stringr::str_wrap("1820: Between country inequality represents 11% 
                                of global inequality", width = 20), size = 3) + 
  annotate("text", x = 1980, y = 0.70, 
      label = stringr::str_wrap("1980: Between country inequality represents 57% 
                                of global inequality", width = 20), size = 3) +
  annotate("text", x = 1990, y = 0.30, 
      label = stringr::str_wrap("2020: Between country inequality represents 32% 
                                of global inequality", width = 20), size = 3)
```


---

### F11: Top 1% vs bottom 50% wealth shares in Western Europe and the US, 1910-2020 - pivot name_sep + fit curve

```{r data-f11, cash = TRUE}
df_f11 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F11"); df_f11
```

---

```{r eval = FALSE}
df_f11 %>% 
  rename(!year, US_bot50 = USbot50, US_top1 = UStop1, 
         EU_bot50 = EUbot50, EU_top1 = EUtop1) %>%
  pivot_longer(!year, names_to = c("group",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") %>%
  ggplot() +
  stat_smooth(aes(x = year, y = value, color = group, linetype = type), 
              span = 0.25, se = FALSE) +
  scale_x_continuous(breaks = round(seq(1910, 2020, by = 10),1)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 11. Top 1% vs bottom 50% wealth shares 
       \n in Western Europe and the US, 1910-2020", 
       x = "", y = "Share of total personal wealth (%)", color = "", linetype = "") +
  scale_linetype_manual(values = c("dotted","solid")) +
  annotate("text", x = 2000, y = 0.50, 
      label = stringr::str_wrap("Wealth inequality has been rising at 
        different speeds after a historical decline. The bottom 50% has always been 
                                extremely low.", width = 30), size = 3)
```


---

#### Step 1.

```{r}
df_f11 %>% rename(!year, US_bot50 = USbot50, US_top1 = UStop1, 
                  EU_bot50 = EUbot50, EU_top1 = EUtop1) 
```

---

#### Step 2. 

```{r warning = FALSE, eval=FALSE}
df_f11 %>% 
  rename(!year, US_bot50 = USbot50, US_top1 = UStop1, 
         EU_bot50 = EUbot50, EU_top1 = EUtop1) %>%
  pivot_longer(!year, names_to = c("group",".value"), names_sep = "_")
```

---

#### Step 2. 

```{r warning = FALSE, echo=FALSE}
df_f11 %>% 
  rename(!year, US_bot50 = USbot50, US_top1 = UStop1, EU_bot50 = EUbot50, EU_top1 = EUtop1) %>%
  pivot_longer(!year, names_to = c("group",".value"), names_sep = "_")
```

---

#### Step 3.

```{r warning = FALSE, eval=FALSE}
df_f11 %>% 
  rename(!year, US_bot50 = USbot50, US_top1 = UStop1, 
         EU_bot50 = EUbot50, EU_top1 = EUtop1) %>%
  pivot_longer(!year, names_to = c("group",".value"), 
               names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") 
```

---

#### Step 3.

```{r warning = FALSE, echo=FALSE}
df_f11 %>% 
  rename(!year, US_bot50 = USbot50, US_top1 = UStop1, EU_bot50 = EUbot50, EU_top1 = EUtop1) %>%
  pivot_longer(!year, names_to = c("group",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") 
```

---

```{r warning = FALSE, echo=FALSE}
df_f11 %>% 
  rename(!year, US_bot50 = USbot50, US_top1 = UStop1, EU_bot50 = EUbot50, EU_top1 = EUtop1) %>%
  pivot_longer(!year, names_to = c("group",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") %>%
  ggplot() +
  stat_smooth(aes(x = year, y = value, color = group, linetype = type), formula = y ~ x, method = "loess", span = 0.25, se = FALSE) +
  scale_x_continuous(breaks = round(seq(1910, 2020, by = 10),1)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 11. Top 1% vs bottom 50% wealth shares \n in Western Europe and the US, 1910-2020", 
       x = "", y = "Share of total personal wealth (%)", color = "", linetype = "") +
  scale_linetype_manual(values = c("dotted","solid")) +
  annotate("text", x = 2000, y = 0.50, label = stringr::str_wrap("Wealth inequality has been rising at different speeds after a historical decline. The bottom 50% has always been extremely low.", width = 30), size = 3)
```

---

### F8: The rise of private versus the decline of public wealth in rich countries, 1970-2020 - rename + pivot + pivot + fit curve

```{r data-f8, cash = TRUE}
df_f8 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F8"); df_f8
```

---

```{r warning=FALSE, eval = FALSE}
df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") %>%
  ggplot() +
  stat_smooth(aes(x = year, y = value, color = country, linetype = type), 
              span = 0.25, se = FALSE, size=0.75) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 8. The rise of private versus the decline of public 
       wealth in rich countries, 1970-2020", 
       x = "", y = "wealth as as % of national income", color = "", type = "")
```

---

#### Step 1

```{r warning=FALSE, eval=FALSE}
df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') 
```


---


```{r warning=FALSE, echo=FALSE}
df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)')
```




---

#### Step 2.

```{r warning=FALSE, eval=FALSE}
df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") 
```


---

```{r warning=FALSE, echo=FALSE}
df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_")
```

---

#### Step 3.

```{r warning=FALSE, eval=FALSE}
df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value")
```

---


```{r warning=FALSE, echo=FALSE}
df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") 
```

---

#### Step 3. Final Step

```{r eval=FALSE, warning=FALSE}
df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") %>%
  ggplot() +
  stat_smooth(aes(x = year, y = value, color = country, linetype = type), 
              formula = y~x, method = "loess", span = 0.25, se = FALSE, size=0.75) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 8. The rise of private versus the decline of public wealth 
       \nin rich countries, 1970-2020", 
       x = "", y = "wealth as as % of national income", color = "", type = "")
```


---

```{r echo=FALSE, warning=FALSE}
df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") %>%
  ggplot() +
  stat_smooth(aes(x = year, y = value, color = country, linetype = type), 
              formula = y~x, method = "loess", span = 0.25, se = FALSE, size=0.75) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 8. The rise of private versus the decline of public wealth 
       \nin rich countries, 1970-2020", 
       x = "", y = "wealth as as % of national income", color = "", type = "")
```

---

### F15: Per capita emissions acriss the world, 2019 - add row names + dodge

```{r data-f15, cash = TRUE}
df_f15 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F15"); df_f15
```

---

```{r eval = FALSE}
df_f15 %>% mutate(region = rep(regionWID[!is.na(regionWID)], each = 3)) %>%
  select(region, group, tcap) %>%
  ggplot(aes(x = region, y = tcap, fill = group)) +
  geom_col(position = "dodge") + 
  scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 10)) +
  labs(title = "Figure 15 Per capita emissions across the world, 2019", 
       x = "", y = "tonnes of CO2e per person per year", fill = "")
```


---

```{r echo = FALSE}
df_f15 %>% mutate(region = rep(regionWID[!is.na(regionWID)], each = 3)) %>%
  select(region, group, tcap) %>%
  ggplot(aes(x = region, y = tcap, fill = group)) +
  geom_col(position = "dodge") + 
  scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 10)) +
  labs(title = "Figure 15 Per capita emissions across the world, 2019", 
       x = "", y = "tonnes of CO2e per person per year", fill = "")
```


## EDA Workflow

### EDA Step 0

1. Choose and clarify a topic to study.
2. List questions to study
3. Find data:
  - link to data with a url: universal resource locator in a webpage
  - download data in csv, Excel, etc.

Repeat the process during your EDA.


![image](data/data-science.png)

### EDA by R Studio: Step 1

In RStudio,

1.1. Project

  * Create a new project: File > New Project; or  
  * Open a project: File > Open Project, Open Project in New Session, Open Recent Project  
  * _Check there is a file `project_name.Rproj` in your project folder (directory)_

 
1.2. data folder (directory) `data`

  * Create a data folder: Press New Folder at the right bottom pane; or 
  * Confirm the data folder previously created: Press Files at the right bottom pane
  * _If you follow 1, the data folder exists in your project folder_

 
1.3. Move (or copy) data for the project to the data folder

  * If you downloaded the data, it is in your Download folder. Move it to `data`.
  * _Check in your RStudio that your data is in `data`: Press Files at the right bottom pane and click `data`, the data folder._
  

---

### EDA by R Studio: Step 2

2.1. Project Notebook: Memo

  - Create an R Notebook: File > New File > R Notebook
  
    + You can use R Notebook template in Moodle by moving the template (template.Rmd or template.nb.Rmd) file in your project folder or copy and paste the text file into your new R Notebook.
    + If you use template.nb.Rmd (R Notebook File), choose Open in Editor.
    
  - Add descriptive title. 
  
2.2. Setup Code Chunk 

  - Create a code chunk and add packages to use in the project and RUN the code.
  
      + library(tidyverse)
      + library(WDI)
      + or any other packages

---

2.3. Choose `Source` or `Visual` editor mode, and start editing Project Notebook

  - Set up Headings such as: About, Data, Analysis and Visualizations, Conclusions
  - Under About or Data, paste url of the sites and/or the data

      + eg. World Development Indicator:
      https://datatopics.worldbank.org/world-development-indicators/)
      + eg. Public expenditure on education:
      https://data.un.org/_Docs/SYB/CSV/SYB65_245_202209_Public%
      20expenditure%20on%20education.csv)


2.4. Edit a new file by saving as for a report

  - File > Save As...

---

### EDA by R Studio: Step 3 - Importing Data

Assign a name you can recall easily when you import data. You may need to reload the data with options.

3.1. Use a package:

  * WDI, wir, eurostat, etc/
  * `wdi_shortname <- WDI(indicator = "indicator's name", ... )
  * Store the data and use it: `write_csv(wdi_shortname, "data/wdi_shortname.csv")`
  * `wdi_shortname <- read_csv("data/wdi_shortname.csv")`
  
3.2. Use `readr` to read from `data`, your data folder

  * `df1_shortname <- read_csv("data/file_name.csv")`

---

3.3. Use `readr` to read using the url of the data

  * `df2_shortname <- read_csv("url_of_the_data")`
  * Store the data and use it: `write_csv(df2_shortname, "data/df2_shortname.csv")`
  * `df2_shortname <- read_csv("data/df2_shortname.csv")`
  
3.5. Use `readxl` to read Excel data. Add `library(readxl)` in the setup and run.

  * `df4 <- read_excel("data/file_name.xlsx", sheet = 1)`
  
References: Cheat Sheet - `readr`, [readr](https://readr.tidyverse.org), [readxl](https://readxl.tidyverse.org)


---

### EDA by R Studio: Step 4 - Data Trasnformation

4.1. Look at the data: suppose `df` is the data frame

  * It is a good option to change into a tibble: `dt <- as_tibble(df)`
  * `head(df)`, `str(df)`, `summary(df)`, `dt`, `glimpse(dt)`

4.2. Look at each variable

  * categorical? numerical? 
  * factor? - [forcats](https://forcats.tidyverse.org)
  
4.3. Variation of each data: suppose `x1` is a column name.

  * `df %>% ggplot() + geom_histogram(aes(x1), bins = 30)`
  * `df %>% drop_na(x1)`: see the rows with a value in `x1`. If the value is NA, the row is not shown.
  
    - `df_wo_na <- df %>% drop_na(x1)` if you want to use only the rows without NA in `x1`
    
---

4.4. Use `dpylr` and `tidyr` to change column names, tidy data, and/or summarize data

  * `rename`, `select`, `filter`, `arrange`, `mutate`, `pivot_longer()`, `pivot_wider()`, `group_by` and `summarize`


References: Cheat Sheet - `dplyr` and `tidyr`, [dplyr](https://dplyr.tidyverse.org), [tidyr](https://tidyr.tidyverse.org)

---

### EDA by R Studio: Step 5 - Visualize Data

5.1. In combination with Stap 4 - data transformation, try various data visualization.

  * What type of variation occurs within my variables?
  * What type of covariation occurs between my variables?


5.2. Keep a record of what you can observe by the visualization

5.3. Edit the list of questions by adding or polishing

5.4. Select several informative chart and add options

5.5. Look at examples from the textbooks or teaching site to have better visualization

References: Cheat Sheet - `ggplot2` [ggplot2](https://ggplot2.tidyverse.org), [ggplot2 book](https://ggplot2-book.org)

---

### EDA by R Studio: Step 6 - Conclusions and Questions for Further Study

1. EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

2. Generate questions about your data

3. Search for answers by visualising, transforming, and/or modeling your data

Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.

---

### Example: WDI

* Government expenditure on education, total (% of GDP)

  - https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS
  
* ID: SE.XPD.TOTL.GD.ZS

---

### Example: WIR2022

```{r warning=FALSE, echo=FALSE}
df_f8 %>% 
  select(year, Germany_public = Germany, Germany_private = 'Germany (private)', 
         Spain_public = Spain, Spain_private = 'Spain (private)', 
         France_public = France, France_private = 'France (private)', 
         UK_public  = UK, UK_private = 'UK (private)', 
         Japan_public = Japan, Japan_private = 'Japan (private)', 
         Norway_public = Norway, Norway_private = 'Norway (private)',
         USA_public = USA, USA_private = 'USA (private)') %>%
  pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
  pivot_longer(3:4, names_to = "type", values_to = "value") %>%
  ggplot() +
  stat_smooth(aes(x = year, y = value, color = country, linetype = type), formula = y~x, method = "loess", span = 0.25, se = FALSE, size=0.75) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Figure 8. The rise of private versus the decline of public wealth \nin rich countries, 1970-2020", 
       x = "", y = "wealth as as % of national income", color = "", type = "")
```

## The Week Five Assignment (in Moodle)

**`tidyr` and WIR2022**

* Create an R Notebook of a Data Analysis containing the following and submit the rendered HTML file (eg. `a3_123456.nb.html`  by replacing 123456 with your ID)
  1. create an R Notebook using the R Notebook Template in Moodle,  save as `a3_123456.Rmd`, 
  2. write your name and ID and the contents, 
  3. run each code block, 
  4. preview to create `a3_123456.nb.html`,
  5. submit  `a3_123456.nb.html` to Moodle.

1. Choose a data with at least two categorical variables and at least two numerical variables.

    - Information of the data: Name, Indicator, Description, Source, etc.
    - Explain why you chose the indicator
    - List questions you want to study

---

2. Explore the data using visualization using `ggplot2`

    - Create various charts
    - Create at least one chart with at least two categorical variables and at least one numerical variable.
    - Create at least one chart with at least two numerical variables and at least one categorical variable.

3. Observations based on your data visualization, and difficulties and questions encountered if any.

**Due:** 2023-01-23 23:59:00. Submit your R Notebook file in Moodle (The Fourth Assignment). Due on Monday!


# Exploratory Data Analysis (EDA) V  

## Modeling

> “Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham

> “all happy families are all alike; each unhappy family is unhappy in its own way” - Tolstoy's Anna Karenina


Correlation

```{r}
iris %>% select(-5) %>% cor()
```

Galton's data. Regression.


Normalization, standalization

SDGs Academy

Trend


```{r}
df4 <- WDI(
  country="all",
  indicator = c("SP.POP.TOTL","MS.MIL.XPND.GD.ZS","MS.MIL.TOTL.P1","NY.GDP.MKTP.CD"),
  start = 1960,
  end = NULL,
  extra = FALSE,
  cache = NULL,
  latest = NULL,
  language = "en"
)
```

```{r}
colnames(df4) <- c("Country","iso2c","iso3c","Year","Population","Mil_Exp","Total_Mil_HC","GDP")
df4
```

### Tali

```{r}
df_wdi_poverty <- WDI(
  country = "all", 
  indicator = c(national_poverty_rate = "SI.POV.NAHC", multidimentional_poverty_rate = "SI.POV.MDIM", gdpPercap = "NY.GDP.PCAP.KD", gini_indx = "SI.POV.GINI"), start = 1990,
    end = 2021,
  extra = TRUE
)
```

```{r}
df_wdi_poverty %>%  
  group_by(country, year) %>%
  mutate(mean_gdp_country = mean(gdpPercap)) %>%
  mutate(mean_poverty_country= mean(national_poverty_rate)) %>%
  ungroup() %>%
    filter(!is.na(income)) %>%   filter(!(income=="Aggregates")) %>% ggplot(aes(x = log10(mean_gdp_country))) + geom_point(aes(y = mean_poverty_country, color = income)) +  labs(x = "GDP per capita", y = "poverty rate (% of population)", title = "Poverty rates and GDP per capita", subtitle="world countries, 1990-2021 average, by income level")
```

```{r}
df_wdi_poverty %>%  
  group_by(country, year) %>%
  mutate(mean_gdp_country = mean(gdpPercap)) %>%
  mutate(mean_multipoverty_country= mean(multidimentional_poverty_rate)) %>%
  ungroup() %>%
    filter(!(region=="Aggregates")) %>% filter(!is.na(region)) %>% ggplot(aes(x = log10(mean_gdp_country))) + geom_point(aes(y = mean_multipoverty_country, color = region)) +  labs(x = "GDP per capita", y = "Multidimentinal poverty rate (% of population)", title = "Multidimentional Poverty rates and GDP per capita", subtitle="world countries, 1990-2021 average, by region")
```

index sdg, ghi, academy

moocs

R4DS: Model basics
  https://r4ds.had.co.nz/model-basics.html
modelr: https://modelr.tidyverse.org

Tidymodels: https://www.tidymodels.org

Tidyverse Skills for Data Science
https://jhudatascience.org/tidyversecourse/
https://jhudatascience.org/tidyversecourse/model.html


machine learning:A Gentle Introduction to tidymodels https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/

Tidy Modeling with R
https://www.tmwr.org


### Other Examples

## Roudups

### R Markdown Revisited

#### Literate Programming and Reproducible Research

Importing Data: 

1. Read a csv file: `read_csv("data/file_name.csv")`
2. Download and import using a url of a csv file: `read_csv(url)`
3. Read an Excel file: `readxl::read_excel("data/excel_file_name.xlsx")`
4. Read from the clipboard: `read_delim(clipboard())`

  - Not reproducible unless clearly explained.
  
#### Code Chunk Options

https://yihui.org/knitr/options/

* Chunk Name
* Output: use document default
  - Show code and output: echo=TRUE, eval=TRUE - Default
  - Show output only: echo=FALSE
  - Show nothing (run code): include=FALSE
  - Show nothing (don't run code): include=FALSE, eval=FALSE
* Show message: message=TRUE, FALSE
* Show warning: warning=TRUE, FALSE
* Use Paged Tables: paged.print=TRUE, FALSE
* Use custom figure size: width and height in inch.

#### Presentation and Paper

1. Data Source
2. Variables
3. Problems
4. Visualization
5. Model
6. Conclusions and Further Research
 
   WDI, WIR, etc

#### Word

Custom Word templates: https://bookdown.org/yihui/rmarkdown-cookbook/word-template.html

You can apply the styles defined in a Word template document to new Word documents generated from R Markdown. Such a template document is also called a “style reference document.” The key is that you have to create this template document from Pandoc first, and change the style definitions in it later. Then pass the path of this template to the reference_docx option of word_document

```
---
 word_document:
    reference_docx: "template.docx"
---
```

#### PowerPoint

PowerPoint presentation: https://bookdown.org/yihui/rmarkdown/powerpoint-presentation.html

Custom templates: https://bookdown.org/yihui/rmarkdown/powerpoint-presentation.html#ppt-templates

```
---
  powerpoint_presentation:
    reference_doc: my-styles.pptx
---
```

https://support.microsoft.com/en-us/office/create-and-save-a-powerpoint-template-ee4429ad-2a74-4100-82f7-50f8169c8aca

YouTube: How To Create A PowerPoint Template